Tax Audit Reporting Using Python
Author: CA. Vishnu Manickam Ganesan
Author: CA. Vishnu Manickam Ganesan
Problem Statement
Preparing Tax Audit Report involves manually extracting and verifying data for clauses like:
• Clause 20(b) – ESI & PF payment details
• Clause 34(a) – TDS return reporting
These rely on data from challans, TDS text files, and ITR JSONs, which are:
• Time-consuming to process manually
• Prone to human errors
• Difficult to cross-check consistently across sources
Objective
To automate Clause 34(a) of the Tax Audit Report by:
Extract of desired output:
TOOLS AND LIBRARIES USED
TOOLS
Pyton, Anaconda Navigator, Jupyter Notebook, Excel
LIBRARIES
Pandas, OS, pdfplumber, re, openpyxl
WRONG APPROACH TO OUR OBJECTIVE:
To enter the complete prompt to get desired output all in one go.
The wrong prompt is
“i want you to read all the pdf files in this folder C:\Users\DELL\Downloads\global tds\global tds and extract information of date, quarter, form no, tax deductor details. and combine the information in reporting format as
Deductor | TAN | Form No | Quarter | Due Date | Date of Filing
Where due date is:
Q1 – 31st July,
Q2 – 31 October,
Q3 – 31 January,
Q4 – 31 May
Give me full pythoncode”
Result
We get only errors when running the python code, it is hard to identify the exact mistake and correct the prompt or the python code ending up in a never-ending loop of mistakes!
CORRECT APPROACH
STEP — BY — STEP
APPROACH
CHAT GPT PROMPT
Step 1) First Read one pdf and locate the position where the required data is available
Step 2) After locating the position of desired data train chat gpt that in any such pdf the data will be stored in exact same position and ask a python code to extract that data
“337169600037502 GLOBAL HEARING AID CENTRE PRIVATE LIMITED NA QVQYKDED”
but strip it to only name. And you can also find Date, TAN, AO Code, Form No, Periodicity, Type of Statement, Financial Year:
24 April 2024 CMBG05920F CHEW 6911 24Q Q4 Regular 2023-24
in which I need only Date | TAN | Periodicity | Name of the Deductor.
I want full python code to get this information”
Step 3) Now ask chat gpt to do the same process for all the pdfs in that folder and combine them to an excel file in reporting format as required by the statute
C:\Users\DELL\Downloads\global tds\global tds
with the following columns:
Name of Deductor | TAN | Form No | Quarter | Due Date | Date
To fill Quarter and Due Date:
Quarter - Periodicity Value
Due Date - If periodicity is:
Q1 – 31st July,
Q2 – 31st October,
Q3 – 31st January,
Q4 – 31st May”
Python code:The python code to get the above result is
import os
import pdfplumber
import pandas as pd
import re
def extract_info_from_pdf(pdf_path):
info = {
"Deductor": "",
"TAN": "",
"Form No": "",
"Quarter": "",
"Date of Filing": "",
"Due Date": ""
}
try:
with pdfplumber.open(pdf_path) as pdf:
full_text = "\n".join(
page.extract_text() for page in pdf.pages if page.extract_text()
)
except Exception as e:
print(f"Error reading {pdf_path}: {e}")
return info # Return empty info if file can't be read
# Regex patterns
patterns = {
"Deductor": r"Name of Deductor\s*:\s*(.*)",
"TAN": r"TAN\s*:\s*(\w+)",
"Form No": r"Form No\s*:\s*(\w+)",
"Quarter": r"Periodicity\s*:\s*(Q[1-4])",
"Date of Filing": r"Date\s*:\s*([0-9]{2}/[0-9]{2}/[0-9]{4})"
}
for key, pattern in patterns.items():
match = re.search(pattern, full_text)
if match:
info[key] = match.group(1).strip()
# Due dates by quarter
due_dates = {
"Q1": "31st July",
"Q2": "31st October",
"Q3": "31st January",
"Q4": "31st May"
}
if info["Quarter"] in due_dates:
info["Due Date"] = due_dates[info["Quarter"]]
return info
# Folder containing PDFs
folder_path = r"C:\Users\DELL\Downloads\global tds\global tds"
records = []
for filename in os.listdir(folder_path):
if filename.lower().endswith(".pdf"):
file_path = os.path.join(folder_path, filename)
data = extract_info_from_pdf(file_path)
records.append(data)
# Safe creation of DataFrame
df = pd.DataFrame([{
"Deductor": rec.get("Deductor", ""),
"TAN": rec.get("TAN", ""),
"Form No": rec.get("Form No", ""),
"Quarter": rec.get("Quarter", ""),
"Due Date": rec.get("Due Date", ""),
"Date of Filing": rec.get("Date of Filing", "")
} for rec in records])
# Display the output
print(df)
# Optional: Export to Excel
df.to_excel("TDS_Filing_Report.xlsx", index=False)
CONTINUATION TO GET FULL CLAUSE 34 REPORT OF FORM 3CD
In continuation to get other details such as section wise deductor, Tds deductions and amounts for which tds is deductible, Section etc
We read the .txt file generated while uploading the fuv file at the time of filing the tds return in which data will be stored in a structured format all we need to do is locate the data. For example if a particular line has value starting with ^DD^ it means deduction details in the same line 13th value will be tds amount, 21st value will be section etc. Can specified to train the chat gpt!
Later we get the complete code to generate the desired report!
Python Code
import pandas as pd
import os
import pdfplumber
import re
from collections import defaultdict
def process_tax_files(folder_path):
section_totals = defaultdict(lambda: {"tds_amount": 0.0, "deductible_amount": 0.0})
if not os.path.exists(folder_path):
print("Error: Folder does not exist!")
return pd.DataFrame(columns=["Deductor", "TAN", "Section", "Nature of Payment", "Total Payments of Specified Nature", "Sum Liable to TDS", "TDS"])
section_positions = {
"4JB": {"tds": 13, "deductible": 21, "new_section": "194J", "nature": "Professional Fee"},
"94Q": {"tds": 13, "deductible": 21, "new_section": "194Q", "nature": "Purchases"},
"4IB": {"tds": 13, "deductible": 21, "new_section": "194IB", "nature": "Rent"},
"94C": {"tds": 13, "deductible": 21, "new_section": "194C", "nature": "Contract"},
"92B": {"tds": 13, "deductible": 21, "new_section": "192B", "nature": "Salary"},
}
for filename in os.listdir(folder_path):
if filename.endswith(".txt"):
file_path = os.path.join(folder_path, filename)
with open(file_path, "r") as file:
for line in file:
parts = line.strip().split("^")
if len(parts) < 23:
continue
if parts[1] == "DD":
section_code = next((parts[i] for i in range(len(parts) - 1, 0, -1) if parts[i] in section_positions), None)
if not section_code:
continue
try:
tds_amount = float(parts[section_positions[section_code]["tds"]])
deductible_amount = float(parts[section_positions[section_code]["deductible"]])
except (ValueError, IndexError):
continue
section_totals[section_code]["tds_amount"] += tds_amount
section_totals[section_code]["deductible_amount"] += deductible_amount
df_results = pd.DataFrame.from_dict(section_totals, orient="index").reset_index()
df_results.rename(columns={"index": "Section", "tds_amount": "TDS", "deductible_amount": "Total Payments of Specified Nature"}, inplace=True)
df_results["Section"] = df_results["Section"].map(lambda x: section_positions[x]["new_section"])
df_results["Nature of Payment"] = df_results["Section"].map(lambda x: section_positions[x]["nature"])
df_results["Sum Liable to TDS"] = df_results["Total Payments of Specified Nature"]
df_results = df_results[["Deductor", "TAN", "Section", "Nature of Payment", "Total Payments of Specified Nature", "Sum Liable to TDS", "TDS"]]
return df_results
def extract_pdf_data(folder_path):
extracted_data = []
if not os.path.exists(folder_path):
print("Error: Folder does not exist!")
return pd.DataFrame(columns=["Deductor", "TAN", "Form No", "Quarter", "Due Date", "Date of Filing"])
due_date_map = {"Q1": "31 July", "Q2": "31 October", "Q3": "31 January", "Q4": "31 May"}
for filename in os.listdir(folder_path):
if filename.endswith(".pdf"):
file_path = os.path.join(folder_path, filename)
with pdfplumber.open(file_path) as pdf:
text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])
deductor, tan, form_no, quarter, date_of_filing = None, None, None, None, None
lines = text.split("\n")
for i, line in enumerate(lines):
if "Number Deductor/Collector be quoted on TDS" in line and i + 1 < len(lines):
deductor = re.sub(r'^\d+\s+', '', lines[i + 1].strip())
deductor = re.sub(r'\s+NA.*$', '', deductor)
elif "Date TAN AO Code Form No Periodicity Type of Statement Financial Year" in line and i + 1 < len(lines):
details = lines[i + 1].split()
if len(details) >= 7:
date_of_filing = f"{details[0]} {details[1]} {details[2]}"
tan = details[3]
form_no = details[5]
quarter = details[6]
extracted_data.append([deductor, tan, form_no, quarter, due_date_map.get(quarter, "Unknown"), date_of_filing])
df_results = pd.DataFrame(extracted_data, columns=["Deductor", "TAN", "Form No", "Quarter", "Due Date", "Date of Filing"])
return df_results
def save_to_excel(folder_path, output_file):
df1 = process_tax_files(folder_path)
df2 = extract_pdf_data(folder_path)
if not df2.empty:
df1["Deductor"] = df2.iloc[0, 0]
df1["TAN"] = df2.iloc[0, 1]
with pd.ExcelWriter(output_file, engine="openpyxl") as writer:
df1.to_excel(writer, sheet_name="TDS Sections", index=False)
df2.to_excel(writer, sheet_name="TDS Sections", index=False, startrow=len(df1) + 3)
if __name__ == "__main__":
folder_path = r"C:\\Users\\DELL\\Downloads\\global tds\\global tds"
output_file = "tds_combined_output.xlsx"
save_to_excel(folder_path, output_file)
print(f"Results saved to {output_file}")
CONCLUSION
In short, AI offers a powerful solution for streamlining tax audit reporting, delivering efficiency and accuracy gains for ICAI professionals.