I have an uncompressed PDF file. I'm trying to find all locations/positions that contain the following format: <digit> 0 obj at the beginning of a line. E.g.: 3 0 obj.
Once I found an occurrence, I want to know how much bytes are "occupied" from the beginning of the file until that position found.
The file also contains, next to regular ASCII characters, binary data. A snippet is shown here below:
/CapHeight 667 >> endobj 1 0 obj << /Subtype /Type1C /Length 1194 2 0 obj >> stream 3 0 obj BXTBMO+URWTypewriterTOT-LigNar %øøøûûúNú% Lm ÷/÷>¦ 7D(URW)++,Copyright 2003 by (URW)++ Design & Developmen +t/FSType 4 def - B F H J O S r ¡S»OÑûb싸øÕ¸÷Å÷°À +ø÷KŸ—{zwû }†…}ûŠ}†‘™ø™‘™Ô›”“𙂓{ûp{ƒ|}•ƒ›¾™…}ü}†…}X{ƒ|}•ƒ› +øMœ’’Å}µ÷g°÷4µµÄ÷¢Áø£z’„œÜš”“˜˜‚’|k}†‘š÷΃ªr¨°m\\œH +^f„|oaviZb{—~™™–•›È”¸¬×¶©‚v¡¤t‘uJˆ„„‰ƒ‹Œ„YŒ<Z„{hYso^P4Ç +UîἬڬ.÷!±´‰‰ˆŽ†„e~L~u^r^sPDa±ÊÕį÷º}µ÷V´÷Aµ±ÆøN÷rš“’› +¥ƒº¨àoH¹+û8.û.û-Þ/÷ÖÀ¢º®Ÿ¥šœ›‚–}…„w†H{Ue?T^Ÿ±lr©zº±–’‘ +˜´‚†•«Ÿ¼¥©©¤´œº·±|p¤¨lž^d……òûIµ÷#µø ´oµ±Ã÷ÐÁ¬ø.…uƒovat_wJ>Y¨¾ˆ„’{{~~zužj¤vo¬º~ÑÛ»¸¯§—¶Ïø +™‘™²š”’˜™‚’|3{„„zI\\ÎsO°8û=1û\'û#Ý-÷¹ž¯®œ‘–’£û-÷Ø·®|l +©g–fAG‚fsken^uZ+JÛ÷÷ÊÖï‹´ø´öÜÞÜKÁè÷.øLœ„’z4|‚ƒ~”ƒš± +™…}ûê}†…}e|‚ƒ~”ƒš÷6𔓗˜‚“|k}†‘™ðløãuyyuužx¡¡¡¢ytñ‹ +´ø´oµðÁ÷–ÁØ÷/øLœ„’z3|‚ƒ~”ƒš²™…}ûê}†…}d|‚ƒ~”ƒš÷=𔓗˜ +‚“|e}†‘™÷i¸ïÀÉàÙµ_9û}†…}e|‚ƒ~”ƒš÷=𔓗˜‚“|d}†‘™÷œîSÁ +%>TfIu ‹´ø´oµðÁÐ÷/øLœ„’z3|‚ƒ~”ƒš²™…}ûê}†…}d|‚ƒ~”ƒš÷ +M𔓗˜‚“|U}†‘™÷\\ۥº¢0”ž©“œŽ‹‰‘Š’Ž‹”‘’—›}•rec{rqyy‚{‚i +wŸødŸ÷KŸ¶ öŽÃ’ ÷Œø\\ endstream endobj 4 0 obj << /Length 422 >>
To find the locations in the above code snippet, I'm using a regex like so: qr/^\d+ 0 obj/m.
This is the test code I'm using ($pdf contains the string and \d is replace with a fixed number as a test):
This results in the following output:my $result = $pdf=~qr/^1 0 obj/m; say "Finding first item at start position [$-[0]]" if $result; say "Finding first item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^2 0 obj/m; say "Finding second item at start position [$-[0]]" if $result; say "Finding second item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^3 0 obj/m; say "Finding third item at start position [$-[0]]" if $result; say "Finding third item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^4 0 obj/m; say "Finding fourth item at start position [$-[0]]" if $result; say "Finding fourth item at start position [$+[0]]\n" if $result;
Finding first item at start position [26] Finding first item at start position [33] Finding second item at start position [68] Finding second item at start position [75] Finding third item at start position [87] Finding third item at start position [94] Finding fourth item at start position [2035] Finding fourth item at start position [2042]
In reply to Calculated position incorrect when using regex in text file that also contains binary info by geertvc
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |