fpscolin has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I've encountered a problem while trying to split a string. This string is a Slurp of a large HTML file (~3million characters long, no images), which needs to be split into two smaller strings in order to be passed to a PDF converter. At the moment, this is what I have:

if (length($html[$i]) > 1000000) ## $html[$i] holds the current string + of html { my @bigHTML = ### Split here #########; for (my $z=0; $z <= $#bigHTML; $z++) { open (my $file, '>', "giant_${i}_$z.htm") or die $!; print $file $bigHTML[$z]; close $file; push(@lists, "giant_${i}_$z.htm"); } } else { ### Proceed normally

Ideally I would replace the ### Split here #### with  $html[$i] =~ /(.{500000,}?<\/html>/gs; (splitting every 500,000 characters) but this does not work since split is limited to int max (~32000 iirc). Splitting every 32000 characters results in far too many smaller strings

Any suggestions as to how to get around this?

Thanks, Colin

edited for formatting

Replies are listed 'Best First'.
Re: Split very big string in half
by afoken (Chancellor) on Apr 15, 2015 at 19:13 UTC
    needs to be split into two smaller strings

    Your code creates more than two smaller strings/files. What do you want? Two equal-sized strings/files or strings/files with a maximum length?

    strings [...] to be passed to a PDF converter

    Blindly splitting HTML at arbitary offsets will most likely create invalid HTML. Depending on how it is rendered, you will lose some information. In the worst case, the converter may simply refuse to render it at all.

    Back to your posted problem, assuming strings/files with a size limit, maybe more than two parts: Have you tried substr? Something like this (untested):

    if (length($html[$i]) > 1000000) { my $z=0; while ($html[$i] ne '') { open (my $file, '>', "giant_${i}_$z.htm") or die $!; print $file substr($html[$i],0,500000,''); close $file; push(@lists, "giant_${i}_$z.htm"); $z++; } } else { # ...

    Note that the above code is destructive due to using a four-argument substr, i.e. $html[$i] will be empty after the code has run. Non-desstructive code can use the $z counter, like this:

    if (length($html[$i]) > 1000000) { my $z=0; while ($html[$i] ne '') { open (my $file, '>', "giant_${i}_$z.htm") or die $!; print $file substr($html[$i],$z*500000,($z+1)*500000); close $file; push(@lists, "giant_${i}_$z.htm"); $z++; } } else { # ...

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      You are correct, my apologies. I should have clarified; this massive HTML string is actually a concatenation of hundreds of smaller HTML files. I realize this sounds stupid, but I am combining hundreds of files into few (about 10) and then checking if any of these is too big. This is necessary for reasons I won't go into too much (table of contents generation, internal pdf linkage...). Ideally I would split the string at the first closing </html> tag past 500,000 characters. This is why I was looking to use a regex  /(.{500000,}?<\/html>/gs

        "...I am combining hundreds of files into few (about 10) and then checking if any of these is too big..."

        Perhaps you should rethink your concept?

        I'm unsure if my idea fits to your needs but couldn't you convert each file to pdf first and then concatenate them, perhaps step by step?

        As far as i remember CAM::PDF provides this feature.

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        karlgoethebier is right. Your concept is broken.

        You can't concat several HTML documents to make one big document. Of course your computer let's you do exactly that, but while the result may be rendered by some generous browser, it is really junk.

        Split the big file into the original documents, and pass each document separately to the PDF converter. Splitting should not be that hard, assuming the original documents are reasonably clean:

        1. Open the big file for reading
        2. open an output file
        3. read a line from the big file
        4. if the line contains something that looks like the start of a HTML document ("<!DOCTYPE", "<HTML", "<?xml"), write everything up to the match to the current output file, then close it, create a new file, write the match and everything following it to the new file.
        5. else, write the line to the current output file
        6. repeat from step 3 until eof
        7. close input and output files

        You may need to add some special cases:

        • The first output file will usually be empty if there is no junk at the start of the big file.
        • All of the signatures for the document start may be found in a single document, following each other (a valid XHTML document starts with <?xml version="1.0" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">, whereas HTML5 starts with <!DOCTYPE html><html>). You usually want only the first match.

        A simple trick is to assume that the <?xml and <!DOCTYPE declarations are relatively short, but a complete HTML document needs much more data, at least 500 characters (or something like that). So if tell OUTPUT returns a non-negative number less than 500 when matching a signature, don't create a new output file, but continue to write to the old output file. This also avoids an empty first file.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Split very big string in half
by ikegami (Patriarch) on Apr 15, 2015 at 19:01 UTC
      While this didn't work, it pointed me in the right direction so thanks :)

      The line that ended up working (splitting on the first </html> after 499,000 chars) is /(?:.{1000}){1,499}.*?<\/html>/gs;

Re: Split very big string in half
by Laurent_R (Canon) on Apr 15, 2015 at 20:26 UTC
    On the problem as stated,I would tend to prefer substr rather than a regex. But of course you still have to look for an html boundary.

    However, since you are reading many small html files into your string, I would think that it should be possible to read the files in a while loop and interrupt the process when you reach the limit to to the PDF conversion, and then to proceed with the reading of the files. Something like this (this is untested pseudo-code, not an actual solution, we don't have enough information for a real solution):

    use File::Slurp; # (...) my $current_string = ""; my $current_size = 0; for my $file (@html_files) { my $new_file_string = read_file($file); # File::Slurp function my $len = length $new_file_string; if ($current_size + $len > $size_limit) { convert_to_pdf($current_string); $current_string = $new_file_string; $current_size = $len; } else { $current_string .= $new_file_string; $current_size += $len; } } convert_to_pdf($current_string) if $current_string;
    Again, this is just untested pseudo-code to illustrate the idea, not an actual solution. There are a few edge-cases to consider: for example, this is likely to fail if a single html file is larger than $size_limit (I understand from what you said that this should not be the case, but you might have to accept in such a case that the resulting PDF file will be larger than your limit, or maybe raise an exception, whatever is best suited to your actual situation).

    Je suis Charlie.
Re: Split very big string in half
by pvaldes (Chaplain) on Apr 15, 2015 at 19:40 UTC

    Maybe you are trying to solving the wrong problem here. It seems that what you are looking for is a stream, not a string. To slurp entirely a huge file having 'tie' seems a wrong approach to me. You could run out of memory

    How to do this depens on your converter. File::stream, or File::tie could be useful. The idea is to write your big file to a buffer and then feed your converter continuously in smaller chunks. (If you are using bash take a look also to xargs)

Re: Split very big string in half
by Anonymous Monk on Apr 15, 2015 at 19:55 UTC

    maybe something with index() and substr()

    my $splitpoint = index $html[$i], '</html>', 500_000; if( $splitpoint >= 0 ) { my $fragment = substr $html[$i], 0, $splitpoint + length '</html>', +''; ...
Re: Split very big string in half
by Anonymous Monk on Apr 15, 2015 at 20:54 UTC
    # read through big string with changed end-of-line :) # untested my @bigHTML; do { local $/ = '</html>'; # or "</html>\n" maybe open my $fh, '<', \$html[$i] or die "..."; my $smallstring = ''; while(<$fh>) { $smallstring .= $_; if( length $smallstring > 500_000 ) { push @bigHTML, $smallstring; $smallstring = ''; } } length $smallstring and push @bigHTML, $smallstring; # leftover };

      Use the above method to read through the big file instead of slurping it.

Re: Split very big string in half
by locked_user sundialsvc4 (Abbot) on Apr 16, 2015 at 01:39 UTC

    I would very-quickly agree with the following sentiment:   “So, you have a ~3 million character file.   (So what?)   Why do you feel the need to slurp it?”

    The short-answer, I think, is:   “you don’t.”   :-)   Simply treat it as the file that it naturally is.   seek() to the midpoint, less (say), 10,000 characters, and read 10,000 characters.   Look within this string for the end-tag that you seek.   Once you find it, you can effortlessly calculate the file-offset where it appears, and that’s the split-point that you’re looking for.   Exactly the same memory-frugal logic could be applied to a file of any size.

    You do not need to use a big string ... to split a big file.