PDF Metadata¶
PDF has two different types of metadata: XMP metadata, and DocumentInfo, which is deprecated but still relevant. For backward compatibility, both should contain the same content. pikepdf provides a convenient interface that coordinates edits to both, but is limited to the most common metadata features.
XMP (Extensible Metadata Platform) Metadata is a metadata specification in XML format that is used many formats other than PDF. For full information on XMP, see Adobe’s XMP Developer Center. The XMP Specification also provides useful information.
pikepdf can read compound metadata quantities may be read, but only scalar
quantities can be modified. For more complex changes consider using the
python-xmp-toolkit
library and its libexempi dependency; but note that it is
not capable of synchronizing changes to the older DocumentInfo metadata.
Accessing metadata¶
The XMP metadata stream is attached the PDF’s root object, but to simplify
management of this, use pikepdf.Pdf.open_metadata()
. The returned
pikepdf.models.PdfMetadata
object may be used for reading, or entered
with a with
block to modify and commit changes. If you use this interface,
pikepdf will synchronize changes to new and old metadata.
A PDF must still be saved after metadata is changed.
In [1]: pdf = pikepdf.open('../tests/resources/sandwich.pdf')
In [2]: meta = pdf.open_metadata()
In [3]: meta['xmp:CreatorTool']
Out[3]: 'ocrmypdf 5.3.3 / Tesseract OCR-PDF 3.05.01'
If no XMP metadata exists, an empty XMP metadata container will be created.
Open metadata in a with
block to open it for editing. When the block is
exited, changes are committed (updating XMP and the Document Info dictionary)
and attached to the PDF object. The PDF must still be saved. If an exception
occurs in the block, changes are discarded.
In [4]: with pdf.open_metadata() as meta:
...: meta['dc:title'] = "Let's change the title"
...:
The list of available metadata fields may be found in the XMP Specification.
Checking PDF/A conformance¶
The metadata interface can also test if a file claims to be conformant to the PDF/A specification.
In [5]: pdf = pikepdf.open('../tests/resources/veraPDF test suite 6-2-10-t02-pass-a.pdf')
In [6]: meta = pdf.open_metadata()
In [7]: meta.pdfa_status
Out[7]: '1B'
Note
Note that this property merely tests if the file claims to be conformant to the PDF/A standard. Use a tool such as veraPDF to verify conformance.
Low-level XMP metadata access¶
You can read the raw XMP metadata if desired. For example, one could extract it and
edit it using the full featured python-xmp-toolkit
library.
In [8]: xmp = pdf.root.Metadata.read_bytes()
In [9]: type(xmp)
Out[9]: bytes
In [10]: print(xmp.decode())
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:creator>
<rdf:Seq>
<rdf:li>veraPDF Consortium</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
<rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="" xmp:CreatorTool="veraPDF Test Builder" xmp:CreateDate="2015-03-10T17:19:21+01:00" xmp:ModifyDate="2015-03-10T17:19:21+01:00"/>
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="" pdf:Producer="veraPDF Test Builder 1.0 "/>
<rdf:Description xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" rdf:about="" pdfaid:part="1" pdfaid:conformance="B"/>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Editing XMP with a generic XML library is probably not worth the trouble; the semantics are fairly complex.
Warning
Manually changes to XMP stream object will not be synchronized with live PdfMetadata object or the DocumentInfo block.
The Document Info dictionary¶
The Document Info block is an older, now deprecated object in which metadata
may be stored. The Document Info is not attached to the /Root object.
It may be accessed using the .docinfo
property. If no Document Info exists,
touching the .docinfo
will properly initialize an empty one.
Here is an example of a Document Info block.
In [11]: pdf = pikepdf.open('../tests/resources/sandwich.pdf')
In [12]: pdf.docinfo
Out[12]:
pikepdf.Dictionary({
"/CreationDate": "D:20170911132748-07'00'",
"/Creator": "ocrmypdf 5.3.3 / Tesseract OCR-PDF 3.05.01",
"/ModDate": "D:20170911132748-07'00'",
"/Producer": "GPL Ghostscript 9.21"
})
It is permitted in pikepdf to directly interact with Document Info as with
other PDF dictionaries. However, it is better to use .open_metadata()
because that interface will apply changes to both XMP and Document Info in a
consistent manner.
You may copy from data from a Document Info object in the current PDF or another
PDF into XMP metadata using load_from_docinfo()
.