Nonsense. It's easy-peasy. Slurp the whole file into memory; convert to a character string, then then offer filehandle-like accessors to that string.
Obviously, you want to avoid slurping the whole file into memory, but that's "just" an optimization. Worry about that when you've got the easy-peasy implementation working right.
As it happens, with UTF-32 you do know whether you're in the middle of a character (as in: codepoint, rather than grapheme), because each character is exactly 32 bits. Just take the byte offset modulo 4. So UTF-32 is an easy case to optimize and avoid slurping the entire file.
UTF-8 is harder but not much. If the high bit is set on a byte, you're in a multibyte sequence. If the second highest bit is also set, you're at the start of a multibyte sequence, and then you can count how many bits there are until the first zero bit, and that tells you how many bytes are in the sequence.
So you optimize specific, common cases, and fall back to the slurping technique.
In reply to Re^2: How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by tobyink
in thread How do I use the "File::ReadBackwards" and open in "Unicode text, UTF-32, little-endian" mode
by hashperl
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |