Character encoding¶
There are three hard problems in computer science: 1) Converting from PDF, 2) Converting to PDF, and 3) O̳̳̳̳̳̳̳̳̳̳̳̳̳̳̳̳̳Ҙ҉҉҉ʹʹ҉ʹ̨̨̨̨̨̨̨̨̃༃༃O̳̳̳̳̳̳̳̳̳̳̳̳̳̳̳̳̳Ҙ҉҉҉ʹʹ҉ʹ̨̨̨̨̨̨̨̨̃༃༃ʹʹ҉ʹ̨̨̨̨̨̨̨̨̃༃༃
In most circumstances, pikepdf performs appropriate encodings and
decodings on its own, or returns pikepdf.String
if it is not clear
whether to present data as a string or binary data.
str(pikepdf.String)
is performed by inspecting the binary data. If the
binary data begins with a UTF-16 byte order mark, then the data is
interpreted as UTF-16 and returned as a Python str
. Otherwise, the data
is returned as a Python str
, if the binary data will be interpreted as
PDFDocEncoding and decoded to str
. Again, in most cases this is correct
behavior and will operate transparently.
Some functions are available in circumstances where it is necessary to force a particular conversion.
PDFDocEncoding¶
The PDF specification defines PDFDocEncoding, a character encoding used only in PDFs. It is quite similar to ASCII but not equivalent.
When pikepdf is imported, it automatically registers "pdfdoc"
as a codec
with the standard library, so that it may be used in string and byte
conversions.
"•".encode('pdfdoc') == b'\x81'
Other codecs¶
Two other codecs are commonly used in PDFs, but they are already part of the standard library.
WinAnsiEncoding is identical Windows Code Page 1252, and may be converted
using the "cp1251"
codec.
MacRomanEncoding may be converted using the "macroman"
codec.