Overview
The KYC & Onboarding Assistant is a desktop Python GUI app that reads PAN and Aadhaar (image or PDF), extracts KYC details and the Aadhaar photo, prepares a branded onboarding PDF, and lets you select policy PDFs to stamp with the employee’s name and timestamp before merging them into a single bundle. The app can also email the stamped bundle via the Gmail API or SMTP.
Built with: Tkinter PyMuPDF OpenCV Pillow pytesseract ReportLab
Key Features
- OCR PAN & Aadhaar (PDF/JPG/PNG) to auto-populate Name, Father’s Name, DOB, Gender, PAN, Aadhaar, Address.
- Extracts the Aadhaar photo automatically and supports a manual crop tool with optional 45×35 (h/w) ratio lock.
- Generates a branded Onboarding PDF with org logo, footer stamp (org name + page numbers), and embedded photo.
- Lets you select policy PDFs from a folder, then stamps each page with Policy name, Accepted by (employee name) and timestamp, and merges them.
- Sends the stamped bundle via Gmail API (or SMTP) directly from the app.
- Configurable defaults in
kyc_config.yaml(org name, logo path, default policy folder, email provider settings). - Basic format checks for PAN and Aadhaar, and cleanups for address OCR artifacts.
Installation
- Python 3.9+ recommended.
- Install Python packages:
pip install pillow pytesseract opencv-python reportlab PyPDF2 PyMuPDF pyyaml \ google-api-python-client google-auth-httplib2 google-auth-oauthlib - Install Tesseract OCR engine (required by
pytesseract):- Windows: download the installer from the official repo (e.g., UB Mannheim build) and add the install path to your
PATH. - macOS:
brew install tesseract - Linux (Debian/Ubuntu):
sudo apt-get install tesseract-ocr
- Windows: download the installer from the official repo (e.g., UB Mannheim build) and add the install path to your
- Clone / copy the project files to a folder where you have read/write permissions.
Configuration (YAML)
Create or edit kyc_config.yaml in the app directory. Example:
org_name: "Your Company Pvt Ltd"
logo_path: "assets/logo.png" # optional
policy_folder: "policies" # default folder opened in the UI
footer_org: "Your Company Pvt Ltd"
email:
provider: gmail_api # or "smtp"
smtp:
server: smtp.example.com
port: 587
username: user@example.com
password: your_app_password
use_starttls: true
gmail_api:
credentials_file: credentials.json # from Google Cloud Console (OAuth)
token_file: token.json # generated on first run
The app writes default keys if the YAML is missing, so you can start with only the fields you need and fill the rest later.
Usage & Workflow
- Launch the app (e.g.,
python kyc_onboarding_gui_v6_patch_manual_crop_fix.py). - In Files, browse for:
- PAN (PDF or image)
- Aadhaar FRONT (PDF or image)
- Aadhaar BACK (PDF or image) — for address
- Optionally set the Policy folder (or configure it in YAML)
- Click Extract & Populate.
- Fields fill in automatically; you can edit them directly.
- If the photo needs adjustment, click Crop Photo Manually.
- Click Save Onboarding PDF — produces a branded, single-page form with the photo.
- Click Select & Combine Policies to choose which policies to include; the app stamps and merges them into one PDF.
- Click Email Stamped Policies to send the merged file via configured provider.
OCR & Photo Extraction
OCR
- Uses
pytesseract(Tesseract engine) for text. - Regex-based extraction for PAN, Aadhaar, DOB, etc., plus heuristics for names and father’s name.
- Address parser strips label text (e.g., “Address:”, “Postal Address:”), ignores non-address lines (DOB/Gender), and stops at the first 6-digit PIN when possible.
Photo
- Tries to detect and crop to the Aadhaar photo frame using OpenCV; falls back to a clean face-aware crop.
- Manual Crop: drag a rectangle over the preview; keep the 45×35 ratio lock for Aadhaar-style portraits or uncheck it for freeform.
Onboarding PDF
- Generated with ReportLab (A4 portrait).
- Includes org logo (optional), title, populated KYC fields, and the cropped photo in a bounded box.
- Footer stamp: Org name — Page 1 of 1.
- Names are normalized to Title Case; PAN and Aadhaar formats are normalized for readability.
Policies: Select, Stamp & Merge
- Choose a policy folder; the dialog lists PDFs and pre-checks common policy names (HR, IS, Social Media, NDA, Reimbursement).
- Each policy is stamped on every page with:
- Policy: inferred from filename or keywords
- Accepted by: employee’s name and current timestamp
- Footer: Org name — Page N of M
- Stamped PDFs are merged into a single output you choose.
Email Dispatch
The app can send the merged, stamped policy PDF via:
- Gmail API (OAuth 2.0; uses
credentials.jsonand storestoken.jsonon first run) - SMTP (server, port, username, password, starttls) — set
email.provider: smtpin YAML
For Gmail API, create OAuth client credentials in Google Cloud Console (Desktop app), download credentials.json into the app folder.
Validation (PAN / Aadhaar)
- PAN format:
AAAAA9999A(regex-validated). - Aadhaar: 12 digits (displayed as
XXXX XXXX XXXX). - Basic date parsing for DOB; you can edit fields if OCR misses something.
Supported Inputs & Outputs
| Type | Formats | Notes |
|---|---|---|
| PAN / Aadhaar source | PDF, JPG, JPEG, PNG | PDFs are parsed with PyMuPDF; images with Pillow/OpenCV. |
| Photo | Extracted from Aadhaar | Auto-crop + manual crop tool. |
| Onboarding form | PDF (A4) | Branded; photo embedded; footer with page numbers. |
| Policies bundle | Stamped and merged from selected policies. |
Troubleshooting
- insert_text positional args error: ensure the footer stamping uses a point tuple:
page.insert_text((x, y), text, ...). - OCR is noisy / wrong fields: verify Tesseract is installed and language packs are present. Try higher DPI or clearer scans.
- Photo crop includes card edges: use Crop Photo Manually and drag tightly to the face frame; keep ratio lock on.
- Gmail API fails: regenerate
token.jsonby deleting it and re-running; confirm OAuth credentials are for a “Desktop app”. - SMTP fails: verify server/port/STARTTLS settings and app password.
Privacy & Security Notes
- All processing happens locally on your machine; PDFs and images are not uploaded by the app.
- Review generated PDFs before sharing; redact if necessary.
- If using email, secure secrets (SMTP passwords) and limit access to
credentials.json/token.json.
Suggested Folder Structure
project/
├─ kyc_onboarding_gui_v6_patch_manual_crop_fix.py
├─ kyc_config.yaml
├─ assets/
│ └─ logo.png
├─ policies/
│ ├─ HR Policy.pdf
│ ├─ IS Policy.pdf
│ ├─ Social Media Policy.pdf
│ ├─ NDA Policy.pdf
│ └─ Reimbursement Policy.pdf
└─ samples/ # optional test files
Known Limitations
- OCR quality depends on scan clarity; skewed/low-resolution cards reduce accuracy.
- Photo frame detection may vary across Aadhaar template versions; manual crop is provided as a fallback.
- Address parsing relies on heuristics; please review the final address field.
Changelog (highlights)
- v6: YAML config, branding & footer, stamping/merging, Gmail API sender.
- v6 patch: better address cleanup, improved photo border detection.
- v6 + manual crop: added interactive crop dialog (ratio lock), kept auto-crop.
- stamp fix: corrected PyMuPDF
insert_textcall to use point tuple for footer.
License
Proprietary / Internal use. Update this section with your organization’s license or usage terms.