Re^5: use regular expressions across multiple lines from a very large input file

One always will need to have two joined blocks of size n in memory, whenever there are no matches anymore which start in the front block one will have to delete it, treat the rear block as new front and read a new rear block from disk.

        loaded Blocks 
|------[++++++|++++++]------|------|---| file
   A      B      C      D      E      F
             <----->
              match
[download]

Actually I'm not sure if joining two strings can be done without needing twice as much memory!

Anyway I don't think going to extremes is a good idea...

Cheers Rolf

Comment on Re^5: use regular expressions across multiple lines from a very large input file Download Code

Replies are listed 'Best First'.
Re^6: use regular expressions across multiple lines from a very large input file by CountZero (Bishop) on Dec 06, 2010 at 23:27 UTC
OK. I may not have expressed myself clearly enough. My idea was to run the program with a huge block size which will cause an out-of-memory error and then reducing the block size until it no longer errors out. That would automatically take care of the joining of strings and its additional memory use. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^7: use regular expressions across multiple lines from a very large input file by LanX (Saint) on Dec 06, 2010 at 23:55 UTC
OK, but what for? From a certain block size on the speed improvement is negligible (overhead of joining and restarting a regex and so on) And at the same time you have to worry about other processes which might reduce the available memory. I rather prefer algos with a predictable amount of used resources. Cheers Rolf	[reply]