Tax Audit Reporting Using PythonRecord inserted or updated successfully.
AI & Data Analytics

Tax Audit Reporting Using Python

Author: CA. Vishnu Manickam Ganesan

Problem Statement

Preparing Tax Audit Report involves manually extracting and verifying data for clauses like:

• Clause 20(b) – ESI & PF payment details

• Clause 34(a) – TDS return reporting

These rely on data from challans, TDS text files, and ITR JSONs, which are:

• Time-consuming to process manually

• Prone to human errors

• Difficult to cross-check consistently across sources

Objective

To automate Clause 34(a) of the Tax Audit Report by:

  1. Extracting TDS data directly from .pdf & .txt files (generated from Traces)
  2. Matching it with books or other records
  3. Generating a ready-to-use report for Clause 34(a)

Extract of desired output:


TOOLS AND LIBRARIES USED

TOOLS

Pyton, Anaconda Navigator, Jupyter Notebook, Excel

LIBRARIES

Pandas, OS, pdfplumber, re, openpyxl


WRONG APPROACH TO OUR OBJECTIVE:

To enter the complete prompt to get desired output all in one go.

The wrong prompt is

“i want you to read all the pdf files in this folder C:\Users\DELL\Downloads\global tds\global tds and extract information of date, quarter, form no, tax deductor details. and combine the information in reporting format as

Deductor | TAN | Form No | Quarter | Due Date | Date of Filing


Where due date is:

Q1 – 31st July,

Q2 – 31 October,

Q3 – 31 January,

Q4 – 31 May

Give me full pythoncode”

Result

We get only errors when running the python code, it is hard to identify the exact mistake and correct the prompt or the python code ending up in a never-ending loop of mistakes!

CORRECT APPROACH

STEP — BY — STEP

APPROACH

CHAT GPT PROMPT

Step 1) First Read one pdf and locate the position where the required data is available

  1. I want a python code to read the pdf file in this path C:\\Users\\DELL\\Downloads\\global tds\\global tds\\24Q Q4 ACK.pdf “

Step 2) After locating the position of desired data train chat gpt that in any such pdf the data will be stored in exact same position and ask a python code to extract that data

  1. “In the above output which I gave, if you check you can find in 8th line “Number Deductor/Collector be quoted on TDS”. In subsequent line, you can find “Name of deductor” in 13th line as

“337169600037502 GLOBAL HEARING AID CENTRE PRIVATE LIMITED NA QVQYKDED”

but strip it to only name. And you can also find Date, TAN, AO Code, Form No, Periodicity, Type of Statement, Financial Year:

24 April 2024 CMBG05920F CHEW 6911 24Q Q4 Regular 2023-24

in which I need only Date | TAN | Periodicity | Name of the Deductor.

I want full python code to get this information”

Step 3) Now ask chat gpt to do the same process for all the pdfs in that folder and combine them to an excel file in reporting format as required by the statute

  1. “Now I want a full python code to generate this report in Excel file as a report of all the PDFs including form no 26Q and 24Q in that folder

C:\Users\DELL\Downloads\global tds\global tds

with the following columns:

Name of Deductor | TAN | Form No | Quarter | Due Date | Date

To fill Quarter and Due Date:

Quarter - Periodicity Value

Due Date - If periodicity is:

Q1 – 31st July,

Q2 – 31st October,

Q3 – 31st January,

Q4 – 31st May”


Python code:The python code to get the above result is

import os

import pdfplumber

import pandas as pd

import re


def extract_info_from_pdf(pdf_path):

   info = {

"Deductor": "",

"TAN": "",

"Form No": "",

"Quarter": "",

"Date of Filing": "",

"Due Date": ""

   }


   try:

with pdfplumber.open(pdf_path) as pdf:

full_text = "\n".join(

page.extract_text() for page in pdf.pages if page.extract_text()

)

   except Exception as e:

print(f"Error reading {pdf_path}: {e}")

return info # Return empty info if file can't be read


   # Regex patterns

patterns = {

"Deductor": r"Name of Deductor\s*:\s*(.*)",

"TAN": r"TAN\s*:\s*(\w+)",

"Form No": r"Form No\s*:\s*(\w+)",

"Quarter": r"Periodicity\s*:\s*(Q[1-4])",

"Date of Filing": r"Date\s*:\s*([0-9]{2}/[0-9]{2}/[0-9]{4})"

   }


   for key, pattern in patterns.items():

match = re.search(pattern, full_text)

       if match:

info[key] = match.group(1).strip()


   # Due dates by quarter

due_dates = {

"Q1": "31st July",

"Q2": "31st October",

"Q3": "31st January",

"Q4": "31st May"

   }


   if info["Quarter"] in due_dates:

info["Due Date"] = due_dates[info["Quarter"]]


   return info


# Folder containing PDFs

folder_path = r"C:\Users\DELL\Downloads\global tds\global tds"


records = []

for filename in os.listdir(folder_path):

   if filename.lower().endswith(".pdf"):

file_path = os.path.join(folder_path, filename)

data = extract_info_from_pdf(file_path)

records.append(data)


# Safe creation of DataFrame

df = pd.DataFrame([{

"Deductor": rec.get("Deductor", ""),

"TAN": rec.get("TAN", ""),

"Form No": rec.get("Form No", ""),

"Quarter": rec.get("Quarter", ""),

"Due Date": rec.get("Due Date", ""),

"Date of Filing": rec.get("Date of Filing", "")

} for rec in records])


# Display the output

print(df)


# Optional: Export to Excel

df.to_excel("TDS_Filing_Report.xlsx", index=False)


CONTINUATION TO GET FULL CLAUSE 34 REPORT OF FORM 3CD

In continuation to get other details such as section wise deductor, Tds deductions and amounts for which tds is deductible, Section etc

We read the .txt file generated while uploading the fuv file at the time of filing the tds return in which data will be stored in a structured format all we need to do is locate the data. For example if a particular line has value starting with ^DD^ it means deduction details in the same line 13th value will be tds amount, 21st value will be section etc. Can specified to train the chat gpt!

Later we get the complete code to generate the desired report!

Python Code

import pandas as pd

import os

import pdfplumber

import re

from collections import defaultdict


def process_tax_files(folder_path):

   section_totals = defaultdict(lambda: {"tds_amount": 0.0, "deductible_amount": 0.0})


   if not os.path.exists(folder_path):

print("Error: Folder does not exist!")

       return pd.DataFrame(columns=["Deductor", "TAN", "Section", "Nature of Payment", "Total Payments of Specified Nature", "Sum Liable to TDS", "TDS"])

   

section_positions = {

"4JB": {"tds": 13, "deductible": 21, "new_section": "194J", "nature": "Professional Fee"},

"94Q": {"tds": 13, "deductible": 21, "new_section": "194Q", "nature": "Purchases"},

"4IB": {"tds": 13, "deductible": 21, "new_section": "194IB", "nature": "Rent"},

"94C": {"tds": 13, "deductible": 21, "new_section": "194C", "nature": "Contract"},

"92B": {"tds": 13, "deductible": 21, "new_section": "192B", "nature": "Salary"},

   }

   

   for filename in os.listdir(folder_path):

       if filename.endswith(".txt"): 

file_path = os.path.join(folder_path, filename)

           

           with open(file_path, "r") as file:

               for line in file:

parts = line.strip().split("^") 

if len(parts) < 23:

continue 


if parts[1] == "DD":

section_code = next((parts[i] for i in range(len(parts) - 1, 0, -1) if parts[i] in section_positions), None)

if not section_code:

                           continue


try:

                           tds_amount = float(parts[section_positions[section_code]["tds"]])

                           deductible_amount = float(parts[section_positions[section_code]["deductible"]])

except (ValueError, IndexError):

                           continue


section_totals[section_code]["tds_amount"] += tds_amount

section_totals[section_code]["deductible_amount"] += deductible_amount

   

   df_results = pd.DataFrame.from_dict(section_totals, orient="index").reset_index()

df_results.rename(columns={"index": "Section", "tds_amount": "TDS", "deductible_amount": "Total Payments of Specified Nature"}, inplace=True)

df_results["Section"] = df_results["Section"].map(lambda x: section_positions[x]["new_section"])

df_results["Nature of Payment"] = df_results["Section"].map(lambda x: section_positions[x]["nature"])

df_results["Sum Liable to TDS"] = df_results["Total Payments of Specified Nature"]

   df_results = df_results[["Deductor", "TAN", "Section", "Nature of Payment", "Total Payments of Specified Nature", "Sum Liable to TDS", "TDS"]]

   

   return df_results


def extract_pdf_data(folder_path):

   extracted_data = []

   if not os.path.exists(folder_path):

print("Error: Folder does not exist!")

       return pd.DataFrame(columns=["Deductor", "TAN", "Form No", "Quarter", "Due Date", "Date of Filing"])

   

   due_date_map = {"Q1": "31 July", "Q2": "31 October", "Q3": "31 January", "Q4": "31 May"}

   

   for filename in os.listdir(folder_path):

       if filename.endswith(".pdf"): 

file_path = os.path.join(folder_path, filename)

           

           with pdfplumber.open(file_path) as pdf:

text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])

deductor, tan, form_no, quarter, date_of_filing = None, None, None, None, None

               

lines = text.split("\n")

               for i, line in enumerate(lines):

if "Number Deductor/Collector be quoted on TDS" in line and i + 1 < len(lines):

deductor = re.sub(r'^\d+\s+', '', lines[i + 1].strip())

deductor = re.sub(r'\s+NA.*$', '', deductor)

elif "Date TAN AO Code Form No Periodicity Type of Statement Financial Year" in line and i + 1 < len(lines):

details = lines[i + 1].split()

if len(details) >= 7:

                           date_of_filing = f"{details[0]} {details[1]} {details[2]}"

                           tan = details[3]

                           form_no = details[5]

                           quarter = details[6]

               

extracted_data.append([deductor, tan, form_no, quarter, due_date_map.get(quarter, "Unknown"), date_of_filing])

   

   df_results = pd.DataFrame(extracted_data, columns=["Deductor", "TAN", "Form No", "Quarter", "Due Date", "Date of Filing"])

   return df_results


def save_to_excel(folder_path, output_file):

   df1 = process_tax_files(folder_path)

   df2 = extract_pdf_data(folder_path)

   

   if not df2.empty:

df1["Deductor"] = df2.iloc[0, 0] 

df1["TAN"] = df2.iloc[0, 1]

   

   with pd.ExcelWriter(output_file, engine="openpyxl") as writer:

df1.to_excel(writer, sheet_name="TDS Sections", index=False)

df2.to_excel(writer, sheet_name="TDS Sections", index=False, startrow=len(df1) + 3) 


if __name__ == "__main__":

   folder_path = r"C:\\Users\\DELL\\Downloads\\global tds\\global tds" 

   output_file = "tds_combined_output.xlsx"

save_to_excel(folder_path, output_file)

print(f"Results saved to {output_file}")

CONCLUSION


In short, AI offers a powerful solution for streamlining tax audit reporting, delivering efficiency and accuracy gains for ICAI professionals.