in reply to Cleaning HTML Fragments with open tags

Do you rerun the code on the 22e6 fragments on every run? Why not just run the code on those that have changed and store the already cleaned ones in a cache?

  • Comment on Re: Cleaning HTML Fragments with open tags

Replies are listed 'Best First'.
Re^2: Cleaning HTML Fragments with open tags
by learnedbyerror (Monk) on Oct 23, 2018 at 20:26 UTC

    haukex, For this operation, yes, I run it on each fragment.

    There are a number of routines that run against each fragment. In most of them, I need the data to be in the original form. One option that I have considered, is to do as you propose and pre-process all of the raw fragments and store them in a different data base. I am using LMDB as the database. I wrote several tests to compare the processing time. I found that the routines are usually IO bound. As a result, I have minimized my IO reads and try to use the raw data multiple times while it is in memory. As a result, I am where I described.

    Your note did make me think of al alternative though that would still help to keep the reads almost to the minimum, just only slightly larger than the current approach. A little additional information first. My ingestion process reads the raw html files, parses the files and then writes to several databases. The first contains a compressed, serialized object (<Sereal>) of the whole html page as well as it constituents already parsed out. The second contains a compressed serialized version of each fragment. I calculate an MD5sum on the compressed, serialized object and use this as the key. Third is an index database that is configured to allow duplicate keys where the keys are each user and the values are the MD5sum for the fragment. The size of the index data is very small and the amount of processing time to access it very small.

    The idea that you triggered is to check each fragment. If the fragment has errors, use HTML::Tidy to clean. Then save a new record to the fragment database that contains the clean version. Create a fourth database similar to the third, that allows duplicate keys with the user as and containing the MD5sum for the clean version of the fragment. In the vast majority of the cases, that fragment will be the raw fragment. For the remainder, it is the corrected fragment.

    This approach has a small impact on the database size and will consume a relative small amount of additional RAM, about 200MB. It cuts outs the repeated calculation overhead. The changes to the code should also be pretty small!

    I'll give it a try and will let you know what I find

    Overall, I am still interested in learning if there are better performant ways of using my first approach. While I expect to incur some overhead, what I realized was much larger than expected.

    Thanks! lbe