Automated GSTR-3B PDF to Excel Extractor for Monthly Turnover Analysis
AI & Accounting

Automated GSTR-3B PDF to Excel Extractor for Monthly Turnover Analysis

Author : CA Priya Seht

Watch on Youtube

Use Case Title :Automated GSTR-3B PDF to Excel Extractor for Monthly Turnover Analysis


For the (stock) audit report which requires monthly sales summary for a financial year, every time, the user opens the GSTR 3B of each month, manually adds up the figure of outward supplies, copies/writes them in excel/on paper respectively, sorts them month wise and then punches the final figures in the report.

This tool automates extraction of GSTR-3B data (Table 3.1 tables a,b,c and e i.e outward supplies as reported in the GSTR-3B) from PDFs using OCR and text parsing, converts it all into structured data in MS Excel, sorts it as per the Financial Year (i.e April to March chronologically) and then provides ready-to-analyse results for the year to the user. The tool also converts the taxable figures so extracted into Lakhs.

The user gets multiple advantages by using this tool:-

a) The time lost due in opening each PDF, noting down the figures and then converting it in Lakhs is saved

b) Manual errors are eliminated

c) The tool is packaged as an EXE and can be used by anyone without any technical knowledge

d) Works offline, thus data privacy is maintained


Prerequisites:

• Python installed and configured

• Tesseract OCR installed with correct path mapping

• Poppler installed and path configured

• Basic folder structure ready with input PDFs

• Required Python libraries installed using pip

  1. MS Excel

The following prompts were used with Chatgpt to create this tool:-


  1. I want to create a software that can run on my computer wherein it acts like an OCR

reader and when I feed in the PDFs in it , it gives me the figures. For eg: When I run that

software , all the GSTR3B in a folder are read and the output is given in Excel stating

month-wise sale output taxable figure. It can be done using Python, please suggest.

• Tell me what all needs to be installed with the relevant working links to download

everything.

• The extraction tool executed but it is not extracting the data from the pdf. I want "total

taxable value" of "(a) Outward taxable supplies (other than zero rated, nil rated and

exempted)" plus (b) Outward taxable supplies (zero rated) plus (c ) Other outward

supplies (nil rated, exempted)

• I have poppler in downloads, python is installed in C Drive and tesseract-ocr-w64-

setup-5.5.0.20241111 also executed and saved in C Drive.

•I want to create an exe file of this extractor according to this path.

•Now, I want it all to be mentioned in chronological order of a Financial Year

starting From Apr, May,June,July till March.

• Now, I want you to ensure that the excel sheet so created is formatted properly with the

proper column width and the months are arranged from April onwards i.e April, May,

June, Jul, Aug, Sep, Oct,Nov,Dec,Jan,Feb and March. Further, add a last column wherein

the value is converted into lakhs with decimals Like 12345678 is converted as 123.46 .