How to Convert PDF File to Excel File using Python? – Convert PDF to Excel Python Github

Converting PDF to Excel: There are several online tools and websites with the help of which we can easily convert PDF files to Excel. However, converting the PDF files to Excel using Python is much easier. This is because, unlike online tools, we don’t have to upload files to websites to convert them. To convert the data, all that is required is to extract the file into Python. Python uses the function PDF tables API for file conversations.

In this article, let us discuss how to convert PDF files to Excel files using the PDF tables API. Scroll down to find out more.

Extract Data From Multiple PDF Files to Excel Using Python

Given a PDF file, the task is to convert the given PDF file to Excel in Python.

If you work with data, you have probably had or will have to deal with data saved in a pdf file. It is tough to copy a table from a PDF and paste it immediately into Excel. In most cases, we copy text from a PDF file rather than structured Excel tables. As a result, when we paste the data into Excel, we see a portion of text compressed into one cell.

Of course, we don’t want to manually copy and paste individual values into Excel. There is commercial software that permits PDF to Excel conversion, but it is expensive. If you’re prepared to learn a little Python, you can accomplish a reasonably good outcome with fewer than 10 lines of code.

Prerequisites:

What is Excel?

Given Pdf File:

Python Program to Convert PDF File to Excel File

Below are the ways to convert the given pdf file to Excel File in Python:

Using pdftables_api
Using tabula-py

Method #1: Using pdftables_api

The pdftables API Module will be used here to convert the PDF file into any other format. Because it is a basic web-based API, it may be used by any programming language.

Installation:

pip install git+https://github.com/pdftables/python-pdftables-api.git

Collecting git+https://github.com/pdftables/python-pdftables-api.git
Cloning https://github.com/pdftables/python-pdftables-api.git to /tmp/pip-req-build-qfdz6fq6
Running command git clone -q https://github.com/pdftables/python-pdftables-api.git /tmp/pip-req-build-qfdz6fq6
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from pdftables-api==1.1.0) (2.23.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (2021.10.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (2.10)
Building wheels for collected packages: pdftables-api
Building wheel for pdftables-api (setup.py) ... done
Created wheel for pdftables-api: filename=pdftables_api-1.1.0-py3-none-any.whl size=5879 sha256=ddeaa9d1b7e5e0fb16cd34564d1dfa50891be0cb33ec19b70afe5c90830842af
Stored in directory: /tmp/pip-ephem-wheel-cache-o0v_cktl/wheels/80/d5/88/7c51378c0b76213ee939fcb303019731948c2271fc8aab2330
Successfully built pdftables-api
Installing collected packages: pdftables-api
Successfully installed pdftables-api-1.1.0

After installing pdftables we need an API Key to get access.

For getting an API key visit PDFTables.com and login /signup using the email.

Get the API key from https://pdftables.com/pdf-to-excel-api and save it which will be used in the code.

API key Sample:

1)Converting into excel using xlsx() function

Approach:

Import pdftables_api module using the import Keyword.
Verification of API_KEY.
Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable.
Converting the given SamplePdf to excel by passing the given pdf and output excel file path as arguments to the xlsx() function and apply it to the above object.
The Exit of the Program.

Below is the Implementation:

# import pdftables_api module using the import Keyword
import pdftables_api

# Verification of API_KEY 
#Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable
pdf_conversion = pdftables_api.Client('zufjqhsgxitu')

# Converting the given SamplePdf to excel by passing the given pdf and output excel file 
# path as arguments to the xlsx() function and apply it to the above object
pdf_conversion.xlsx("samplePdf.pdf", "resultExcel.xlsx")

Output:

Website	Name
Sheets Tips	Vikram
Sheets Tips	Akash
Sheets Tips	Vishal
Python-Programs	Pavan
Python-Programs	Dhoni
Python-Programs	Virat
BTechGeeks	Devilliers
BTechGeeks	Pant
PythonArray	Smith
PythonArray	Warner

Output Image:

2)Converting into XML using xml() function

Approach:

Import pdftables_api module using the import Keyword.
Verification of API_KEY.
Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable.
Converting the given SamplePdf to XML by passing the given pdf and output XML file path as arguments to the xml() function and apply it to the above object.
The Exit of the Program.

Below is the Implementation:

# import pdftables_api module using the import Keyword
import pdftables_api

# Verification of API_KEY 
#Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable
pdf_conversion = pdftables_api.Client('zufjqhsgxitu')

# Converting the given SamplePdf to XML by passing the given pdf and
# output XML file path as arguments to the xml() function and apply it to the above object.
pdf_conversion.xml("samplePdf.pdf", "result.xml")

Output:

<document page-count="1">
<page number="1">
<table data-filename="file.pdf" data-page="1" data-table="1">
<tr>
<td>Website</td>
<td>Name</td>
</tr>
<tr>
<td>Sheets Tips</td>
<td>Vikram</td>
</tr>
<tr>
<td>Sheets Tips</td>
<td>Akash</td>
</tr>
<tr>
<td>Sheets Tips</td>
<td>Vishal</td>
</tr>
<tr>
<td>Python-Programs</td>
<td>Pavan</td>
</tr>
<tr>
<td>Python-Programs</td>
<td>Dhoni</td>
</tr>
<tr>
<td>Python-Programs</td>
<td>Virat</td>
</tr>
<tr>
<td>BTechGeeks</td>
<td>Devilliers</td>
</tr>
<tr>
<td>BTechGeeks</td>
<td>Pant</td>
</tr>
<tr>
<td>PythonArray</td>
<td>Smith</td>
</tr>
<tr>
<td>PythonArray</td>
<td>Warner</td>
</tr>
</table>
</page>
</document>

Output Image:

Method #2: Using tabula-py

We will use the tabula-py to convert the given pdf to excel file.

Installation:

pip install tabula-py

Output:

Collecting tabula-py
Downloading tabula_py-2.3.0-py3-none-any.whl (12.0 MB)
|████████████████████████████████| 12.0 MB 5.4 MB/s 
Collecting distro
Downloading distro-1.7.0-py3-none-any.whl (20 kB)
Requirement already satisfied: pandas>=0.25.3 in /usr/local/lib/python3.7/dist-packages (from tabula-py) (1.3.5)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from tabula-py) (1.21.6)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.3->tabula-py) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.3->tabula-py) (2022.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.25.3->tabula-py) (1.15.0)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.7.0 tabula-py-2.3.0

Before we begin, we must first install Java and include a java installation path in the PATH variable.

Click here to install Java.
Set the environment path variable to the java installation folder (C: Program Files (x64)Javajre1.8.0 251bin).

1)Excel File Without Index

Approach:

Import the tabula module using the import keyword.
Pass the given pdf file path and number of pages as an argument to the read_pdf() function of the tabula module and store that dataframe to a variable.
Convert the data frame to excel using the to_excel() function by passing the arguments output excel file path and boolean variable index.
The Exit of the Program.

Below is the Implementation:

# import the tabula module using the import keyword
import tabula

# Pass the given pdf file path and number of pages as an argument to the read_pdf() function
# of the tabula module and store that dataframe to a variable.
dataframe = tabula.read_pdf("samplePdf.pdf", pages = 1)[0]

#Convert the data frame to excel using the to_excel() function
# by passing the arguments output excel file path and boolean variable index.
dataframe.to_excel('resultExcel.xlsx',index=False)

Output:

2)Excel File with Index

Approach:

Import the tabula module using the import keyword.
Pass the given pdf file path and number of pages as an argument to the read_pdf() function of the tabula module and store that dataframe to a variable.
Convert the data frame to excel using the to_excel() function by passing the arguments output excel file path and boolean variable index here by default the index value is True.
The Exit of the Program.

Below is the Implementation:

# import the tabula module using the import keyword
import tabula

# Pass the given pdf file path and number of pages as an argument to the read_pdf() function
# of the tabula module and store that dataframe to a variable.
dataframe = tabula.read_pdf("samplePdf.pdf", pages = 1)[0]

#Convert the data frame to excel using the to_excel() function by passing the arguments output excel file path
# and boolean variable index here by default the index value is True.
dataframe.to_excel('resultExcel.xlsx')

Output:

Now that you have been provided with the information on how to convert the PDF files to Excel files using Python, So, the next time you are in a situation where you want to convert PDF files to Excel, use the methods provided here to start converting your files without any difficulty.

Extract Data From Multiple PDF Files to Excel Using Python

Python Program to Convert PDF File to Excel File

Method #1: Using pdftables_api

Method #2: Using tabula-py

Leave a Comment Cancel reply