Office Automation Hub: Bank Statement AnalyserRecord inserted or updated successfully.
AI & Auditing

Office Automation Hub: Bank Statement Analyser

Author: Vaibhav Khanna

Watch on Youtube

Office Automation Hub: Bank Statement Analyser

Table of Contents

  1. Overview, Goals, and Architecture Summary
  2. Domain Model and Glossary
  3. Capabilities and Module Intents
  4. Data Model and Contracts
  5. End-to-End Workflows and State Models
  6. Inter-module Interfaces and Integration Patterns
  7. NFRs: Performance, Reliability, Security, Compliance
  8. Configuration and Environment Matrix
  9. Install, Deploy, Operate (Runbooks)
  10. Testing Strategy and Coverage
  11. User Guides and Troubleshooting
  12. Governance, Versioning, and Release Notes
  13. Appendices

1. Overview, Goals, and Architecture Summary

1.1 Overview

The Bank Statement Analyser is a modular office automation tool designed to ingest, process, and categorize financial transactions from a standard bank statement format. It leverages a rule-based engine combined with optional AI-powered analysis to classify transactions, identify high-value items, and generate insightful summaries and visualizations. The system is designed to significantly reduce manual effort in accounting, auditing, and financial review processes.

1.2 Goals

  1. Automation: Automate the tedious process of manually categorizing bank statement entries.
  2. Standardization: Enforce a consistent categorization and tax treatment methodology using a centralized rules engine.
  3. Insight Generation: Provide high-level insights through automated summaries, dashboards, and high-value transaction reports.
  4. Efficiency: Accelerate workflows for accountants and auditors, such as ITR preparation, audit investigations, and suspense account reconciliation.
  5. Intelligence: Integrate modern AI capabilities (via Google Gemini) to provide intelligent suggestions for uncategorized "Suspense" transactions.
  6. Extensibility: Allow for client-specific rules and optional processing modules (e.g., Self-Improviser) to be added without altering the core logic.

1.3 Architecture Summary

The application follows a modular, three-tier architecture:

  1. Presentation Layer (Frontend): A web-based user interface (bank_analyzer_index.html, bank_analyzer_results.html) built with HTML, Tailwind CSS, and JavaScript. It allows users to upload files, set parameters, and view the analysis results, including interactive charts.
  2. Business Logic Layer (Backend): A Python-based processing engine, orchestrated by the main script bankv14.py. This layer contains the core logic for rule application, data manipulation, and coordination between various processing modules.
  3. Data Processing Modules (Libraries): A collection of specialized Python scripts, each responsible for a specific task:
  4. gemini_analyzer.py: Interfaces with the Google Gemini API for AI-based analysis.
  5. high_value_detector.py: Filters for high-value transactions.
  6. excel_dashboard_util.py: Generates graphical dashboards in the output Excel file.
  7. excel_styling.py: Applies professional formatting to Excel outputs.
  8. self_improviser.py: (Inferred) Analyzes suspense items to suggest new rules.

The system is designed to be run within a web server environment (e.g., Flask), where the frontend triggers the backend processing and renders the results.

2. Domain Model and Glossary

  1. Bank Statement: The primary input file (XLSX) containing transactional data with standard columns like Date, Transaction Remarks, Withdrawal Amount, and Deposit Amount.
  2. Rules File: An Excel workbook (Permanant_Bank.xlsx) that acts as the knowledge base for the system.
  3. Permanent Rules: A set of global categorization rules defined in the "Permanant" sheet of the Rules File, applicable to all clients.
  4. Client Rules: A set of client-specific rules defined in a dedicated sheet (named after the client) in the Rules File. These rules take precedence over Permanent Rules.
  5. Keyword: A specific text string within a rule used to match against the Transaction Remarks of a bank entry.
  6. Category: A high-level classification for a transaction (e.g., PERSONAL, BUSINESS, INVESTMENT). This is the primary output of the rule engine.
  7. Tax Treatment: A more granular classification related to Indian tax regulations (e.g., SALARY, CAPITAL GAINS, TELEPHONE EXP).
  8. Suspense: The default category assigned to any transaction that does not match any existing rule. These are the primary targets for review and AI analysis.
  9. High-Value Transaction (HVT): A transaction whose withdrawal or deposit amount exceeds a user-defined threshold.
  10. Self-Improviser: An optional module that analyzes recurring Suspense transactions to recommend new, permanent rules, thus improving the system's accuracy over time.

3. Capabilities and Module Intents

  1. bankv14.py (Core Orchestrator):
  2. Intent: To manage the entire analysis workflow from start to finish.
  3. Capabilities: Loads data, sequences the execution of processing modules, handles errors, aggregates results, and prepares the final output for both the Excel file and the web UI.
  4. Rule-Based Categorization Engine (within bankv14.py):
  5. Intent: To apply business logic for transaction classification.
  6. Capabilities: Reads Permanent and Client rules, iterates through transactions, and applies category/tax treatment based on keyword matching. Implements a client-over-permanent priority system.
  7. gemini_analyzer.py (AI Analysis Module):
  8. Intent: To provide intelligent, context-aware suggestions for ambiguous transactions.
  9. Capabilities: Securely connects to the Google Gemini API, constructs a precise prompt for financial analysis, parses the structured JSON response, and handles API-specific errors like rate limiting with an exponential backoff strategy.
  10. high_value_detector.py (HVT Module):
  11. Intent: To flag transactions that require special attention due to their monetary value.
  12. Capabilities: A configurable class that filters a DataFrame based on user-provided withdrawal and deposit thresholds.
  13. excel_dashboard_util.py (Reporting Module):
  14. Intent: To create a user-friendly, visual summary of the analysis.
  15. Capabilities: Uses openpyxl to dynamically generate a "Dashboard" sheet within the output Excel file, containing a monthly trend line chart, a category-wise pie chart, and a tax-wise bar chart.
  16. excel_styling.py (Formatting Module):
  17. Intent: To ensure the final Excel output is professional, readable, and consistently formatted.
  18. Capabilities: Applies styles (fonts, colors, borders, alignment) and adjusts column widths for all data tables in the output file.

4. Data Model and Contracts

4.1 Input Data Contracts

  1. Bank Statement (bank_file):
  2. Format: .xlsx
  3. Required Columns: Transaction Remarks, Withdrawal Amount (INR ), Deposit Amount (INR ), and a valid date column (e.g., Date, TXN Date). The column names must be exact.
  4. Rules File (rules_file):
  5. Format: .xlsx
  6. Sheet "Permanant":
  7. Columns: Keyword, Category, Tax Treatment
  8. Sheet [client_name]:
  9. Columns: Keyword, Category, Tax Treatment

4.2 Output Data Contracts

  1. Analyzed Excel File ([client_name]_analyzed_[timestamp].xlsx):
  2. Format: .xlsx
  3. Sheet "Dashboard": Contains charts for visual analysis.
  4. Sheet "Analyzed_Statement": The original bank statement with two new columns: Category and Tax Treatment.
  5. Sheet "Category_Summary": Aggregated sums of withdrawals and deposits, grouped by Category.
  6. Sheet "Tax_Summary": Aggregated sums of withdrawals and deposits, grouped by Tax Treatment.
  7. Sheet "Suspense": A subset of "Analyzed_Statement" containing only transactions categorized as Suspense. If the Gemini Analyzer is run, this sheet includes additional columns: AI Suggestive Category, Justification, Category (AI), Tax Treatment (AI).
  8. Sheet "High_Value_Transactions": (Optional) A list of high-value suspense transactions.
  9. Web UI Data (JSON):
  10. The backend returns a JSON object to the bank_analyzer_results.html template containing keys like category_json, tax_json, detailed_trend_json, and full_transactions_json to populate the interactive charts and modal dialogues.

5. End-to-End Workflows and State Models

Workflow: Standard Analysis

  1. START: User navigates to bank_analyzer_index.html.
  2. User Input: User selects a client, enters the client name for rules, uploads the bank statement, and clicks "Process Statement".
  3. Request: The browser POSTs the form data to the web server.
  4. Orchestration: The server calls the process_bank_statement function in bankv14.py.
  5. Processing:
  6. The function loads the rules and the bank statement.
  7. It applies the rule-based categorization engine.
  8. It generates summaries (Category, Tax).
  9. It writes all dataframes to a new Excel file.
  10. It applies styling (excel_styling.py).
  11. It generates the dashboard (excel_dashboard_util.py).
  12. It prepares JSON data for the UI.
  13. Response: The function returns a dictionary with the status, file paths, and JSON data.
  14. Render: The server renders the bank_analyzer_results.html template, passing the result dictionary.
  15. Display: The user sees a results page with interactive charts and download links.
  16. END.

State Model: Transaction

A transaction can be in one of two states:

  1. Uncategorized (Initial State): Default state upon ingestion.
  2. Categorized (Terminal State): The state after a rule has been successfully matched and applied.

Workflow: AI-Enhanced Analysis

This follows the standard workflow, but with an additional step after rule-based categorization:

  1. 5a. AI Analysis: If the "Analyze Suspense with Gemini AI" checkbox is ticked, the orchestrator iterates through all remaining Suspense transactions and sends them to gemini_analyzer.py for processing. The AI's suggestions are appended to the "Suspense" data.

6. Inter-module Interfaces and Integration Patterns

  1. Orchestrator-Module Pattern: bankv14.py acts as the central orchestrator that calls other modules.
  2. bankv14.py -> gemini_analyzer.py:
  3. Interface: analyze_single_transaction(transaction: dict) -> dict
  4. Integration: bankv14.py calls this function in a loop for each suspense row. The function is synchronous and includes a time.sleep() to manage rate limits internally.
  5. bankv14.py -> high_value_detector.py:
  6. Interface: HighValueTransactionDetector(thresholds).process_transactions(df: DataFrame) -> DataFrame
  7. Integration: bankv14.py instantiates the detector class with user-defined thresholds and calls its processing method.
  8. bankv14.py -> excel_dashboard_util.py:
  9. Interface: create_excel_dashboard(file_path: str)
  10. Integration: Called after the main Excel file is created. It modifies the file in place by adding a new sheet.
  11. bankv14.py -> excel_styling.py:
  12. Interface: apply_excel_styling(workbook: Workbook, ...)
  13. Integration: Called after data is written to the Excel file but before it's saved, allowing it to apply formatting to the openpyxl workbook object directly.

7. NFRs: Performance, Reliability, Security, Compliance

  1. Performance:
  2. The primary bottleneck is the optional Gemini AI analysis due to the per-transaction API calls and the intentional delay (time.sleep).
  3. Standard processing is fast, limited mainly by Pandas DataFrame operations and file I/O. Performance will degrade linearly with the number of rows in the bank statement.
  4. Reliability:
  5. The core processing logic is wrapped in a global try...except block, ensuring that any single failure during processing is caught, logged, and reported to the user without crashing the server.
  6. The Gemini module includes its own retry logic for transient API errors (rate limiting), enhancing its robustness.
  7. Security:
  8. The Google Gemini API key is handled securely using environment variables (os.getenv("GOOGLE_API_KEY")), preventing it from being hardcoded in the source.
  9. File uploads should be handled by the web server with appropriate security measures (e.g., file type validation, size limits, storing in a non-executable directory).
  10. Compliance:
  11. The tool processes financial data, so data privacy and handling are critical. All processing is done server-side, and the data should be managed according to the relevant privacy policies (e.g., GDPR, local data protection laws).

8. Configuration and Environment Matrix

  1. Runtime Environment: Python 3.8+.
  2. Key Libraries: pandas, openpyxl, google-generativeai.
  3. Environment Variables:
  4. GOOGLE_API_KEY: Required for the Gemini AI analysis module. The application will function without it, but the AI feature will be disabled.
  5. File-based Configuration:
  6. rules_excel_path_config: The file path to the Permanant_Bank.xlsx rules file.
  7. output_dir_config: The directory where analyzed output files are saved.
  8. log_dir_config: The directory where log files are stored.

9. Install, Deploy, Operate (Runbooks)

Installation

  1. Clone the repository.
  2. Install Python dependencies: pip install pandas openpyxl google-generativeai
  3. Set the environment variable GOOGLE_API_KEY with your API key.

Deployment

  1. Integrate the bankv14.py script into a web framework like Flask.
  2. Create a route (e.g., /process_bank_statement) that accepts POST requests from the bank_analyzer_index.html form.
  3. Configure the paths for rules, outputs, and logs within the web application's configuration.
  4. Serve the HTML files (index and results) from the appropriate routes.

Operation

  1. Ensure the web server is running.
  2. Ensure the Permanant_Bank.xlsx file is accessible at the configured path.
  3. Monitor the log_dir_config directory for logs to diagnose any processing failures.

10. Testing Strategy and Coverage

  1. Unit Testing: Each Python module should have dedicated unit tests.
  2. test_high_value_detector.py: Test with various thresholds, including zero and none.
  3. test_gemini_analyzer.py: Mock the genai API call to test the prompt construction and JSON parsing logic. Test the retry mechanism.
  4. Test the core rule engine with sample data to ensure client rules override permanent rules and that keywords are matched correctly.
  5. Integration Testing: Test the end-to-end workflow by simulating a web request with sample input files and verifying the contents of the generated Excel file.
  6. User Acceptance Testing (UAT): Users should test the web interface, uploading various real-world bank statements and rules files to validate the accuracy of the output and the usability of the interface.

11. User Guides and Troubleshooting

User Guide

  1. Navigate to the main page.
  2. Download Templates: First, download the "Rules Template" and "Bank Statement Format" to understand the required data structure.
  3. Prepare Files:
  4. Populate the Permanant_Bank.xlsx file with your rules. For specific clients, add new sheets with the client's name.
  5. Ensure your bank statement Excel file matches the column headers from the downloaded format.
  6. Upload Rules: Use the "Upload Rules" form to upload your modified Permanant_Bank.xlsx.
  7. Process Statement:
  8. In the "Processing Engine" form, select the client's official name for the report header.
  9. Enter the client's short name that exactly matches the sheet name in your rules file.
  10. Upload the client's bank statement.
  11. (Optional) Set high-value thresholds and enable the AI Analyzer or Self-Improviser.
  12. Click "Process Statement".
  13. Review Results: The results page will appear with charts and download links for the detailed Excel analysis and log file.

Troubleshooting

  1. Error: "Rules file not found": Check the server configuration to ensure the path to Permanant_Bank.xlsx is correct.
  2. Incorrect Categorization:
  3. Verify the client_name_for_rules input exactly matches the sheet name in the rules file.
  4. Check for typos or extra spaces in your keywords.
  5. AI Analyzer Not Working: Ensure the GOOGLE_API_KEY environment variable is correctly set in the server environment where the script is running.
  6. Charts Not Appearing: Make sure the date column in the bank statement is correctly formatted and has a recognized header (Date, TXN Date, etc.).

12. Governance, Versioning, and Release Notes

  1. Versioning: The system should follow Semantic Versioning (e.g., v1.4.0).
  2. Governance: Changes to the core logic or data contracts must be reviewed and documented. New rules can be added to the Permanant_Bank.xlsx file by authorized users without requiring a code change.
  3. Release Notes (v1.4):
  4. Feature: Integrated Gemini AI for intelligent analysis of suspense transactions.
  5. Feature: Added a web-based results page with interactive charts and a transaction detail modal.
  6. Improvement: Refactored the core script for better error handling and logging.
  7. Improvement: Added an automated Excel dashboard generation module.

13. Appendices

  1. Appendix A: API Schema Reference
  2. Gemini API Request (Prompt):
  3. You are an expert accounting assistant...
  4. Analyze the following bank transaction...
  5. Provide your answer as a single, valid, minified JSON object with 4 keys: "AI Suggestive Category", "Justification", "Category", and "Tax treatment"...
  6. --- TRANSACTION TO ANALYZE ---
  7. Type: {transaction['type']}, Narration: "{transaction['narration']}"
  8. Gemini API Response (Expected JSON):
  9. {
  10.  "AI Suggestive Category": "Example Fuel Expense",
  11.  "Justification": "The narration contains the word 'PETRO', indicating a fuel purchase.",
  12.  "Category": "BUSINESS",
  13.  "Tax treatment": "TRANSPORT"
  14. }