public class COSParser extends BaseParser
PDFParser.parse()
or FDFParser.parse()
must be called before page objects
can be retrieved, e.g. PDFParser.getPDDocument()
.
This class is a much enhanced version of QuickParser
presented in PDFBOX-1104 by Jeremy Villalobos.Modifier and Type | Field | Description |
---|---|---|
protected static char[] |
EOF_MARKER |
EOF-marker.
|
protected long |
fileLen |
file length.
|
protected boolean |
initialParseDone |
|
protected static char[] |
OBJ_MARKER |
obj-marker.
|
protected SecurityHandler |
securityHandler |
The security handler.
|
protected RandomAccessRead |
source |
|
static java.lang.String |
SYSPROP_EOFLOOKUPRANGE |
The range within the %%EOF marker will be searched.
|
static java.lang.String |
SYSPROP_PARSEMINIMAL |
Only parse the PDF file minimally allowing access to basic information.
|
static java.lang.String |
TMP_FILE_PREFIX |
The prefix for the temp file being used.
|
protected XrefTrailerResolver |
xrefTrailerResolver |
Collects all Xref/trailer objects and resolves them into single
object using startxref reference.
|
A, ASCII_CR, ASCII_LF, B, D, DEF, document, E, ENDOBJ_STRING, ENDSTREAM_STRING, J, M, N, O, R, S, seqSource, STREAM_STRING, T
Constructor | Description |
---|---|
COSParser(RandomAccessRead source) |
Default constructor.
|
COSParser(RandomAccessRead source,
java.lang.String password,
java.io.InputStream keyStore,
java.lang.String keyAlias) |
Constructor for encrypted pdfs.
|
Modifier and Type | Method | Description |
---|---|---|
protected void |
checkPages(COSDictionary root) |
Check if all entries of the pages dictionary are present.
|
AccessPermission |
getAccessPermission() |
This will get the AccessPermission.
|
COSDocument |
getDocument() |
This will get the document that was parsed.
|
PDEncryption |
getEncryption() |
This will get the encryption dictionary.
|
protected long |
getStartxrefOffset() |
Looks for and parses startxref.
|
protected boolean |
isCatalog(COSDictionary dictionary) |
Tell if the dictionary is a PDF catalog.
|
boolean |
isLenient() |
Return true if parser is lenient.
|
protected int |
lastIndexOf(char[] pattern,
byte[] buf,
int endOff) |
Searches last appearance of pattern within buffer.
|
protected COSStream |
parseCOSStream(COSDictionary dic) |
This will read a COSStream from the input stream using length attribute within dictionary.
|
protected void |
parseDictObjects(COSDictionary dict,
COSName... excludeObjects) |
Will parse every object necessary to load a single page from the pdf document.
|
protected boolean |
parseFDFHeader() |
Parse the header of a fdf.
|
protected COSBase |
parseObjectDynamically(long objNr,
int objGenNr,
boolean requireExistingNotCompressedObj) |
This will parse the next object from the stream and add it to the local state.
|
protected COSBase |
parseObjectDynamically(COSObject obj,
boolean requireExistingNotCompressedObj) |
This will parse the next object from the stream and add it to the local state.
|
protected boolean |
parsePDFHeader() |
Parse the header of a pdf.
|
protected COSBase |
parseTrailerValuesDynamically(COSDictionary trailer) |
Parse the values of the trailer dictionary and return the root object.
|
protected COSDictionary |
parseXref(long startXRefOffset) |
Parses cross reference tables.
|
protected boolean |
parseXrefTable(long startByteOffset) |
This will parse the xref table from the stream and add it to the state
The XrefTable contents are ignored.
|
protected COSDictionary |
rebuildTrailer() |
Rebuild the trailer dictionary if startxref can't be found.
|
protected COSDictionary |
retrieveTrailer() |
Read the trailer information and provide a COSDictionary containing the trailer information.
|
void |
setEOFLookupRange(int byteCount) |
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.
|
void |
setLenient(boolean lenient) |
Change the parser leniency flag.
|
isClosing, isClosing, isDigit, isDigit, isEndOfName, isEOL, isEOL, isSpace, isSpace, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedChar, readExpectedString, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, skipSpaces, skipWhiteSpaces
protected final RandomAccessRead source
public static final java.lang.String SYSPROP_PARSEMINIMAL
public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
protected static final char[] EOF_MARKER
protected static final char[] OBJ_MARKER
protected long fileLen
protected boolean initialParseDone
protected SecurityHandler securityHandler
protected XrefTrailerResolver xrefTrailerResolver
public static final java.lang.String TMP_FILE_PREFIX
public COSParser(RandomAccessRead source)
public COSParser(RandomAccessRead source, java.lang.String password, java.io.InputStream keyStore, java.lang.String keyAlias)
source
- input representing the pdf.password
- password to be used for decryption.keyStore
- key store to be used for decryption when using public key securitykeyAlias
- alias to be used for decryption when using public key securitypublic void setEOFLookupRange(int byteCount)
DEFAULT_TRAIL_BYTECOUNT
.
We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.
In case system property SYSPROP_EOFLOOKUPRANGE
is defined this value will be set on initialization but
can be overwritten later.
byteCount
- number of trailing bytesprotected COSDictionary retrieveTrailer() throws java.io.IOException
java.io.IOException
- if something went wrongprotected COSDictionary parseXref(long startXRefOffset) throws java.io.IOException
startXRefOffset
- start offset of the first tablejava.io.IOException
- if something went wrongprotected final long getStartxrefOffset() throws java.io.IOException
DEFAULT_TRAIL_BYTECOUNT
bytes (or range set via setEOFLookupRange(int)
) and go back to find
startxref
.java.io.IOException
- If something went wrong.protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)
pattern
- pattern to search forbuf
- buffer to search pattern inendOff
- offset (exclusive) where lookup starts at-1
if pattern could not be foundpublic boolean isLenient()
public void setLenient(boolean lenient)
lenient
- try to handle malformed PDFs.protected void parseDictObjects(COSDictionary dict, COSName... excludeObjects) throws java.io.IOException
dict
- the COSObject from the parent pages.excludeObjects
- dictionary object reference entries with these names will not be parsedjava.io.IOException
- if something went wrongprotected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws java.io.IOException
obj
- object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj
- if true
object to be parsed must not be contained within
compressed streamjava.io.IOException
- If an IO error occurs.protected COSBase parseObjectDynamically(long objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws java.io.IOException
objNr
- object number of object to be parsedobjGenNr
- object generation number of object to be parsedrequireExistingNotCompressedObj
- if true
the object to be parsed must be defined in xref
(comment: null objects may be missing from xref) and it must not be a compressed object within object stream
(this is used to circumvent being stuck in a loop in a malicious PDF)java.io.IOException
- If an IO error occurs.protected COSStream parseCOSStream(COSDictionary dic) throws java.io.IOException
dic
- dictionary that goes with this stream.java.io.IOException
- if an error occurred reading the stream, like problems with reading
length attribute, stream does not end with 'endstream' after data read, stream too short etc.protected final COSDictionary rebuildTrailer() throws java.io.IOException
java.io.IOException
- if something went wrongprotected void checkPages(COSDictionary root)
root
- the root dictionary of the pdfprotected boolean isCatalog(COSDictionary dictionary)
dictionary
- protected boolean parsePDFHeader() throws java.io.IOException
java.io.IOException
- if something went wrongprotected boolean parseFDFHeader() throws java.io.IOException
java.io.IOException
- if something went wrongprotected boolean parseXrefTable(long startByteOffset) throws java.io.IOException
startByteOffset
- the offset to start atjava.io.IOException
- If an IO error occurs.public COSDocument getDocument() throws java.io.IOException
java.io.IOException
- If there is an error getting the document.public PDEncryption getEncryption() throws java.io.IOException
java.io.IOException
- If there is an error getting the document.public AccessPermission getAccessPermission() throws java.io.IOException
java.io.IOException
- If there is an error getting the document.protected COSBase parseTrailerValuesDynamically(COSDictionary trailer) throws java.io.IOException
trailer
- The trailer dictionary.java.io.IOException
- If an IO error occurs or if the root object is missing in the trailer dictionary.Copyright © 2002–2018. All rights reserved.