in reply to Re^2: Reading a huge input line in parts
in thread Reading a huge input line in parts

Given quantities of that magnitude, and the relative simplicity of the task (breaking the stream into a sequence of numerics), I'd say it's worthwhile to write an application in C and compile it.

It would be a short and easy program to write, esp. as a stdin-stdout filter: it's just a while loop that reads a nice size char buffer (say, a few MB at a time), and steps through the buffer one character at a time, accumulating consecutive digit characters, and outputting the string of digits every time you encounter a non-digit character. It wouldn't be more than 20 lines of C code, if that, and you'll save a lot of run-time.

I suppose there must be more to your overall process than just splitting into digit strings; you could still do that extra part of your process in perl, but have the perl script read from the output of the C program. (But again, given the quantity of data, if the other stuff can be done in C without too much trouble, I'd do that.)

UPDATE: Okay, I admit I was wrong about how many lines of C it would take. This C program is 26 30 lines (not counting the 4 blank lines added for legibility):

#include <stdio.h> #define BUFSIZE 5242880 int main( argc, argv ) int argc; char *argv[]; { char buffer[BUFSIZE], digitstr[64]; char *bufptr, *numptr; int nread, i; numptr = digitstr; while (( nread = fread( buffer, 1, BUFSIZE, stdin )) > 0 ) { bufptr = buffer; i = 0; while ( i < nread ) { if ( *bufptr >= 0x30 && *bufptr <= 0x39 ) *numptr++ = *bufptr; else if ( numptr > digitstr ) { *numptr = 0; printf( "%s\n", digitstr ); numptr = digitstr; } bufptr++; i++; } } /* update: need this list bit in case last char in the stream is a dig +it */ if ( numptr > digitstr ) { *numptr = 0; printf( "%s\n", digitstr ); } }
(2nd update: added four more lines at the end to handle the case where the last char in the stream happens to be a digit.)

Replies are listed 'Best First'.
Re^4: Reading a huge input line in parts
by kroach (Pilgrim) on May 05, 2015 at 20:32 UTC
    That's actually a very good idea, haven't thought about this approach.

      You can also use flex to make a scanner with very little fuss. For example:

      %{ void process(char *tok); %} %option noyywrap %% [0-9]+ process(yytext); [ \t\n]+ /* ignore */ . /* printf("Bad input character: %s\n", yytext); */ %% void process(char *tok) { printf("%d\n", atoi(tok)); } int main(int argc, char **argv) { yyin = stdin; if (argc > 1) yyin = fopen(argv[1], "r"); return yylex(); }

      Just run it through flex and compile the generated lex.yy.c.