org.apache.pdfbox.pdfparser
Class NonSequentialPDFParser

java.lang.Object
  extended by org.apache.pdfbox.pdfparser.BaseParser
      extended by org.apache.pdfbox.pdfparser.PDFParser
          extended by org.apache.pdfbox.pdfparser.NonSequentialPDFParser

public class NonSequentialPDFParser
extends PDFParser

PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading of PDFParser. This class can be used as a PDFParser replacement. First parse() must be called before page objects can be retrieved, e.g. getPDDocument(). This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.


Field Summary
protected static int DEFAULT_TRAIL_BYTECOUNT
           
protected static char[] EOF_MARKER
          EOF-marker.
protected static char[] OBJ_MARKER
          obj-marker.
protected  SecurityHandler securityHandler
          The security handler.
protected static char[] STARTXREF_MARKER
          StartXRef-marker.
static java.lang.String SYSPROP_EOFLOOKUPRANGE
           
static java.lang.String SYSPROP_PARSEMINIMAL
           
static java.lang.String TMP_FILE_PREFIX
           
 
Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser
xrefTrailerResolver
 
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, FORCE_PARSING, forceParsing, pdfSource, PROP_PUSHBACK_SIZE
 
Constructor Summary
NonSequentialPDFParser(java.io.File file, RandomAccess raBuf)
          Constructs parser for given file using given buffer for temporary storage.
NonSequentialPDFParser(java.io.File file, RandomAccess raBuf, java.lang.String decryptionPassword)
          Constructs parser for given file using given buffer for temporary storage.
NonSequentialPDFParser(java.io.InputStream input)
          Constructor.
NonSequentialPDFParser(java.io.InputStream input, RandomAccess raBuf, java.lang.String decryptionPassword)
          Constructor.
NonSequentialPDFParser(java.lang.String filename)
          Constructs parser for given file using memory buffer.
 
Method Summary
protected  void decrypt(COSString str, long objNr, long objGenNr)
          Decrypts given COSString.
protected  void deleteTempFile()
          Remove the temporary file.
 PDPage getPage(int pageNr)
          Returns the page requested with all the objects loaded into it.
 int getPageNumber()
          Returns the number of pages in a document.
 PDDocument getPDDocument()
          This will get the PD document that was parsed.
protected  java.io.File getPdfFile()
          Return the pdf file.
 SecurityHandler getSecurityHandler()
          Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.
protected  long getStartxrefOffset()
          Looks for and parses startxref.
protected  void initialParse()
          The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects.
protected  int lastIndexOf(char[] pattern, byte[] buf, int endOff)
          Searches last appearance of pattern within buffer.
 void parse()
          This will parse the stream and populate the COSDocument object.
protected  COSStream parseCOSStream(COSDictionary dic, RandomAccess file)
          This will read a COSStream from the input stream using length attribute within dictionary.
protected  COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj)
          This will parse the next object from the stream and add it to the local state.
protected  COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj)
          This will parse the next object from the stream and add it to the local state.
protected  void readPattern(char[] pattern)
          Reads given pattern from BaseParser.pdfSource.
protected  void releasePdfSourceInputStream()
          Enable handling of alternative pdfSource implementation.
 void setEOFLookupRange(int byteCount)
          Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.
protected  void setPdfSource(long fileOffset)
          Sets BaseParser.pdfSource to start next parsing at given file offset.
 
Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser
getDocument, getFDFDocument, isContinueOnError, parseStartXref, parseTrailer, parseXrefStream, parseXrefTable, setTempDirectory
 
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedString, readInt, readLine, readString, readString, setDocument, skipSpaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SYSPROP_PARSEMINIMAL

public static final java.lang.String SYSPROP_PARSEMINIMAL
See Also:
Constant Field Values

SYSPROP_EOFLOOKUPRANGE

public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
See Also:
Constant Field Values

DEFAULT_TRAIL_BYTECOUNT

protected static final int DEFAULT_TRAIL_BYTECOUNT
See Also:
Constant Field Values

EOF_MARKER

protected static final char[] EOF_MARKER
EOF-marker.


STARTXREF_MARKER

protected static final char[] STARTXREF_MARKER
StartXRef-marker.


OBJ_MARKER

protected static final char[] OBJ_MARKER
obj-marker.


securityHandler

protected SecurityHandler securityHandler
The security handler.


TMP_FILE_PREFIX

public static final java.lang.String TMP_FILE_PREFIX
See Also:
Constant Field Values
Constructor Detail

NonSequentialPDFParser

public NonSequentialPDFParser(java.lang.String filename)
                       throws java.io.IOException
Constructs parser for given file using memory buffer.

Parameters:
filename - the filename of the pdf to be parsed
Throws:
java.io.IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(java.io.File file,
                              RandomAccess raBuf)
                       throws java.io.IOException
Constructs parser for given file using given buffer for temporary storage.

Parameters:
file - the pdf to be parsed
raBuf - the buffer to be used for parsing
Throws:
java.io.IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(java.io.File file,
                              RandomAccess raBuf,
                              java.lang.String decryptionPassword)
                       throws java.io.IOException
Constructs parser for given file using given buffer for temporary storage.

Parameters:
file - the pdf to be parsed
raBuf - the buffer to be used for parsing
decryptionPassword - password to be used for decryption
Throws:
java.io.IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(java.io.InputStream input)
                       throws java.io.IOException
Constructor.

Parameters:
input - input stream representing the pdf.
Throws:
java.io.IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(java.io.InputStream input,
                              RandomAccess raBuf,
                              java.lang.String decryptionPassword)
                       throws java.io.IOException
Constructor.

Parameters:
input - input stream representing the pdf.
raBuf - the buffer to be used for parsing
decryptionPassword - password to be used for decryption.
Throws:
java.io.IOException - If something went wrong.
Method Detail

setEOFLookupRange

public void setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

Parameters:
byteCount - number of trailing bytes

initialParse

protected void initialParse()
                     throws java.io.IOException
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.

Throws:
java.io.IOException - If something went wrong.

setPdfSource

protected final void setPdfSource(long fileOffset)
                           throws java.io.IOException
Sets BaseParser.pdfSource to start next parsing at given file offset.

Parameters:
fileOffset - file offset
Throws:
java.io.IOException - If something went wrong.

releasePdfSourceInputStream

protected final void releasePdfSourceInputStream()
                                          throws java.io.IOException
Enable handling of alternative pdfSource implementation.

Throws:
java.io.IOException - If something went wrong.

getStartxrefOffset

protected final long getStartxrefOffset()
                                 throws java.io.IOException
Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.

Returns:
the offset of StartXref
Throws:
java.io.IOException - If something went wrong.

lastIndexOf

protected int lastIndexOf(char[] pattern,
                          byte[] buf,
                          int endOff)
Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.

Parameters:
pattern - pattern to search for
buf - buffer to search pattern in
endOff - offset (exclusive) where lookup starts at
Returns:
start offset of pattern within buffer or -1 if pattern could not be found

readPattern

protected final void readPattern(char[] pattern)
                          throws java.io.IOException
Reads given pattern from BaseParser.pdfSource. Skipping whitespace at start and end.

Parameters:
pattern - pattern to be skipped
Throws:
java.io.IOException - if pattern could not be read

parse

public void parse()
           throws java.io.IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.

Overrides:
parse in class PDFParser
Throws:
java.io.IOException - If there is an error reading from the stream or corrupt data is found.

getPdfFile

protected java.io.File getPdfFile()
Return the pdf file.

Returns:
the pdf file

deleteTempFile

protected void deleteTempFile()
Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream


getSecurityHandler

public SecurityHandler getSecurityHandler()
Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.

Returns:
the security handler.

getPDDocument

public PDDocument getPDDocument()
                         throws java.io.IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.

Overrides:
getPDDocument in class PDFParser
Returns:
The document at the PD layer.
Throws:
java.io.IOException - If there is an error getting the document.

getPageNumber

public int getPageNumber()
                  throws java.io.IOException
Returns the number of pages in a document.

Returns:
the number of pages.
Throws:
java.io.IOException - if PAGES or other needed object is missing

getPage

public PDPage getPage(int pageNr)
               throws java.io.IOException
Returns the page requested with all the objects loaded into it.

Parameters:
pageNr - starts from 0 to the number of pages.
Returns:
the page with the given pagenumber.
Throws:
java.io.IOException - If something went wrong.

parseObjectDynamically

protected final COSBase parseObjectDynamically(COSObject obj,
                                               boolean requireExistingNotCompressedObj)
                                        throws java.io.IOException
This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.

Parameters:
obj - object to be parsed (we only take object number and generation number for lookup start offset)
requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
Returns:
the parsed object (which is also added to document object)
Throws:
java.io.IOException - If an IO error occurs.

parseObjectDynamically

protected COSBase parseObjectDynamically(int objNr,
                                         int objGenNr,
                                         boolean requireExistingNotCompressedObj)
                                  throws java.io.IOException
This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.

Parameters:
objNr - object number of object to be parsed
objGenNr - object generation number of object to be parsed
requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
Returns:
the parsed object (which is also added to document object)
Throws:
java.io.IOException - If an IO error occurs.

decrypt

protected final void decrypt(COSString str,
                             long objNr,
                             long objGenNr)
                      throws java.io.IOException
Decrypts given COSString.

Parameters:
str - the string to be decrypted
objNr - the object number
objGenNr - the object generation number
Throws:
java.io.IOException - ff something went wrong

parseCOSStream

protected COSStream parseCOSStream(COSDictionary dic,
                                   RandomAccess file)
                            throws java.io.IOException
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.

Overrides:
parseCOSStream in class BaseParser
Parameters:
dic - dictionary that goes with this stream.
file - file to write the stream to when reading.
Returns:
parsed pdf stream.
Throws:
java.io.IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.