global pdfscanner
The pdfscanner
library allows interpretation of PDF content streams and
/ToUnicode
(cmap) streams. You can get those streams from the pdfe
library.
😱 Types incomplete or incorrect? 🙏 Please contribute!
methods
pdfscanner.scan
function pdfscanner.scan(
pdf: (string|PdfeStream|PdfeArray),
operatortable: Operatorable,
info: table
)
The first argument should be a Lua string or a stream or array object coming
from the pdfe
library. The second argument, operatortable
, should
be a Lua table where the keys are PDF operator name strings and the values
are Lua functions (defined by you) that are used to process those operators.
The functions are called whenever the scanner finds one of these PDF operators
in the content stream(s). The functions are called with two arguments: the scanner
object itself, and the info
table that was passed are the third
argument to pdfscanner.scan
.
Internally, pdfscanner.scan
loops over the PDF operators in the
stream(s), collecting operands on an internal stack until it finds a PDF
operator. If that PDF operator's name exists in operatortable
, then the
associated function is executed. After the function has run (or when there is no
function to execute) the internal operand stack is cleared in preparation for the
next operator, and processing continues.
The scanner
argument to the processing functions is needed because it
offers various methods to get the actual operands from the internal operand
stack.
local operatortable = { }
operatortable.Do = function(scanner,info)
local resources = info.resources
if resources then
local val = scanner:pop()
local name = val[2]
local xobject = resources.XObject
print(info.space .. "Uses XObject " .. name)
local resources = xobject.Resources
if resources then local newinfo = { space = info.space .. " ", resources = resources, } pdfscanner.scan(entry, operatortable, newinfo)
end
end
end
local function Analyze(filename)
local doc = pdfe.open(filename)
if doc then
local pages = doc.Pages
for i=1,#pages do local page = pages[i] local info = { space = " " , resources = page.Resources, } print("Page " .. i)
-- pdfscanner.scan(page.Contents,operatortable,info) pdfscanner.scan(page.Contents(),operatortable,info)
end
end
end
Analyze("foo.pdf")
Reference:
- Corresponding C source code: lpdfscannerlib.c#L680-L828