Batch PDF Data Extraction
Originally published: 14/05/2019 10:03
Publication number: ELQ-92419-1
View all versions & Certificate
certified

Batch PDF Data Extraction

Extract text and data from multiple PDF files into structured tabular format based on start and end pattern matching.

Description
The bulk extraction of PDF information is designed to compile and consolidate multiple data points from many similarly structured PDF files in one process. Examples such as application forms, bank statements and survey data can result in many individual PDF files with the need to extract specific data from each.

The batch extraction works by specifying multiple rules for the text surrounding the content required to be extracted. Rule options include wild cards and line feeds. Results are structured in tabular form with the name of each file in rows and content extractions in columns.

The extraction tool uses and relies on a provided executable file which transforms the PDF into a plain text file. This file, as well as the extraction tool needs to be placed in the same file directory as the PDF files from which extraction is required. The process analyzes the folder and processes every PDF file that exists in it.

Additional options include to retain generated text files and append results to existing ones for iterative extraction routines.

The resulting output can be cleared if results are not as expected in order to modify the extraction rules accordingly. The VBA code is open for viewing and modification if required.

This Best Practice includes
1 Excel file, 1 converter executable, 6 demo testing files

Business Spreadsheets offers you this Best Practice for free!

download for free

Add to bookmarks

Discuss

Further information

Extract multiple data streams from similarly structured PDF files into a structured table

Many PDF files require the same data to be extracted for subsequent use and analysis

Scanned PDF files or PDF files that are not able to be converted to plain text.


4.4 / 5 (12 votes)

please wait...