in reply to Re: Split very big string in half
in thread Split very big string in half

You are correct, my apologies. I should have clarified; this massive HTML string is actually a concatenation of hundreds of smaller HTML files. I realize this sounds stupid, but I am combining hundreds of files into few (about 10) and then checking if any of these is too big. This is necessary for reasons I won't go into too much (table of contents generation, internal pdf linkage...). Ideally I would split the string at the first closing </html> tag past 500,000 characters. This is why I was looking to use a regex  /(.{500000,}?<\/html>/gs

Replies are listed 'Best First'.
Re^3: Split very big string in half
by karlgoethebier (Abbot) on Apr 15, 2015 at 21:03 UTC
    "...I am combining hundreds of files into few (about 10) and then checking if any of these is too big..."

    Perhaps you should rethink your concept?

    I'm unsure if my idea fits to your needs but couldn't you convert each file to pdf first and then concatenate them, perhaps step by step?

    As far as i remember CAM::PDF provides this feature.

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re^3: Split very big string in half
by afoken (Chancellor) on Apr 16, 2015 at 11:45 UTC

    karlgoethebier is right. Your concept is broken.

    You can't concat several HTML documents to make one big document. Of course your computer let's you do exactly that, but while the result may be rendered by some generous browser, it is really junk.

    Split the big file into the original documents, and pass each document separately to the PDF converter. Splitting should not be that hard, assuming the original documents are reasonably clean:

    1. Open the big file for reading
    2. open an output file
    3. read a line from the big file
    4. if the line contains something that looks like the start of a HTML document ("<!DOCTYPE", "<HTML", "<?xml"), write everything up to the match to the current output file, then close it, create a new file, write the match and everything following it to the new file.
    5. else, write the line to the current output file
    6. repeat from step 3 until eof
    7. close input and output files

    You may need to add some special cases:

    • The first output file will usually be empty if there is no junk at the start of the big file.
    • All of the signatures for the document start may be found in a single document, following each other (a valid XHTML document starts with <?xml version="1.0" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">, whereas HTML5 starts with <!DOCTYPE html><html>). You usually want only the first match.

    A simple trick is to assume that the <?xml and <!DOCTYPE declarations are relatively short, but a complete HTML document needs much more data, at least 500 characters (or something like that). So if tell OUTPUT returns a non-negative number less than 500 when matching a signature, don't create a new output file, but continue to write to the old output file. This also avoids an empty first file.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)