KYC & Employee Onboarding Assistant

GUI tool to OCR PAN & Aadhaar, extract photo, build a branded onboarding PDF, stamp + merge policies, and email the bundle.

Overview

The KYC & Onboarding Assistant is a desktop Python GUI app that reads PAN and Aadhaar (image or PDF), extracts KYC details and the Aadhaar photo, prepares a branded onboarding PDF, and lets you select policy PDFs to stamp with the employee’s name and timestamp before merging them into a single bundle. The app can also email the stamped bundle via the Gmail API or SMTP.

Built with: Tkinter PyMuPDF OpenCV Pillow pytesseract ReportLab

Key Features

  • OCR PAN & Aadhaar (PDF/JPG/PNG) to auto-populate Name, Father’s Name, DOB, Gender, PAN, Aadhaar, Address.
  • Extracts the Aadhaar photo automatically and supports a manual crop tool with optional 45×35 (h/w) ratio lock.
  • Generates a branded Onboarding PDF with org logo, footer stamp (org name + page numbers), and embedded photo.
  • Lets you select policy PDFs from a folder, then stamps each page with Policy name, Accepted by (employee name) and timestamp, and merges them.
  • Sends the stamped bundle via Gmail API (or SMTP) directly from the app.
  • Configurable defaults in kyc_config.yaml (org name, logo path, default policy folder, email provider settings).
  • Basic format checks for PAN and Aadhaar, and cleanups for address OCR artifacts.

Installation

  1. Python 3.9+ recommended.
  2. Install Python packages:
    pip install pillow pytesseract opencv-python reportlab PyPDF2 PyMuPDF pyyaml \
        google-api-python-client google-auth-httplib2 google-auth-oauthlib
  3. Install Tesseract OCR engine (required by pytesseract):
    • Windows: download the installer from the official repo (e.g., UB Mannheim build) and add the install path to your PATH.
    • macOS: brew install tesseract
    • Linux (Debian/Ubuntu): sudo apt-get install tesseract-ocr
  4. Clone / copy the project files to a folder where you have read/write permissions.

Configuration (YAML)

Create or edit kyc_config.yaml in the app directory. Example:

org_name: "Your Company Pvt Ltd"
logo_path: "assets/logo.png"        # optional
policy_folder: "policies"           # default folder opened in the UI
footer_org: "Your Company Pvt Ltd"
email:
  provider: gmail_api                # or "smtp"
  smtp:
    server: smtp.example.com
    port: 587
    username: user@example.com
    password: your_app_password
    use_starttls: true
  gmail_api:
    credentials_file: credentials.json   # from Google Cloud Console (OAuth)
    token_file: token.json               # generated on first run

The app writes default keys if the YAML is missing, so you can start with only the fields you need and fill the rest later.

Usage & Workflow

  1. Launch the app (e.g., python kyc_onboarding_gui_v6_patch_manual_crop_fix.py).
  2. In Files, browse for:
    • PAN (PDF or image)
    • Aadhaar FRONT (PDF or image)
    • Aadhaar BACK (PDF or image) — for address
    • Optionally set the Policy folder (or configure it in YAML)
  3. Click Extract & Populate.
    • Fields fill in automatically; you can edit them directly.
    • If the photo needs adjustment, click Crop Photo Manually.
  4. Click Save Onboarding PDF — produces a branded, single-page form with the photo.
  5. Click Select & Combine Policies to choose which policies to include; the app stamps and merges them into one PDF.
  6. Click Email Stamped Policies to send the merged file via configured provider.

OCR & Photo Extraction

OCR

  • Uses pytesseract (Tesseract engine) for text.
  • Regex-based extraction for PAN, Aadhaar, DOB, etc., plus heuristics for names and father’s name.
  • Address parser strips label text (e.g., “Address:”, “Postal Address:”), ignores non-address lines (DOB/Gender), and stops at the first 6-digit PIN when possible.

Photo

  • Tries to detect and crop to the Aadhaar photo frame using OpenCV; falls back to a clean face-aware crop.
  • Manual Crop: drag a rectangle over the preview; keep the 45×35 ratio lock for Aadhaar-style portraits or uncheck it for freeform.

Onboarding PDF

  • Generated with ReportLab (A4 portrait).
  • Includes org logo (optional), title, populated KYC fields, and the cropped photo in a bounded box.
  • Footer stamp: Org name — Page 1 of 1.
  • Names are normalized to Title Case; PAN and Aadhaar formats are normalized for readability.

Policies: Select, Stamp & Merge

  • Choose a policy folder; the dialog lists PDFs and pre-checks common policy names (HR, IS, Social Media, NDA, Reimbursement).
  • Each policy is stamped on every page with:
    • Policy: inferred from filename or keywords
    • Accepted by: employee’s name and current timestamp
    • Footer: Org name — Page N of M
  • Stamped PDFs are merged into a single output you choose.

Email Dispatch

The app can send the merged, stamped policy PDF via:

  • Gmail API (OAuth 2.0; uses credentials.json and stores token.json on first run)
  • SMTP (server, port, username, password, starttls) — set email.provider: smtp in YAML

For Gmail API, create OAuth client credentials in Google Cloud Console (Desktop app), download credentials.json into the app folder.

Validation (PAN / Aadhaar)

  • PAN format: AAAAA9999A (regex-validated).
  • Aadhaar: 12 digits (displayed as XXXX XXXX XXXX).
  • Basic date parsing for DOB; you can edit fields if OCR misses something.

Supported Inputs & Outputs

TypeFormatsNotes
PAN / Aadhaar sourcePDF, JPG, JPEG, PNGPDFs are parsed with PyMuPDF; images with Pillow/OpenCV.
PhotoExtracted from AadhaarAuto-crop + manual crop tool.
Onboarding formPDF (A4)Branded; photo embedded; footer with page numbers.
Policies bundlePDFStamped and merged from selected policies.

Troubleshooting

  • insert_text positional args error: ensure the footer stamping uses a point tuple: page.insert_text((x, y), text, ...).
  • OCR is noisy / wrong fields: verify Tesseract is installed and language packs are present. Try higher DPI or clearer scans.
  • Photo crop includes card edges: use Crop Photo Manually and drag tightly to the face frame; keep ratio lock on.
  • Gmail API fails: regenerate token.json by deleting it and re-running; confirm OAuth credentials are for a “Desktop app”.
  • SMTP fails: verify server/port/STARTTLS settings and app password.

Privacy & Security Notes

  • All processing happens locally on your machine; PDFs and images are not uploaded by the app.
  • Review generated PDFs before sharing; redact if necessary.
  • If using email, secure secrets (SMTP passwords) and limit access to credentials.json/token.json.

Suggested Folder Structure

project/
├─ kyc_onboarding_gui_v6_patch_manual_crop_fix.py
├─ kyc_config.yaml
├─ assets/
│  └─ logo.png
├─ policies/
│  ├─ HR Policy.pdf
│  ├─ IS Policy.pdf
│  ├─ Social Media Policy.pdf
│  ├─ NDA Policy.pdf
│  └─ Reimbursement Policy.pdf
└─ samples/   # optional test files

Known Limitations

  • OCR quality depends on scan clarity; skewed/low-resolution cards reduce accuracy.
  • Photo frame detection may vary across Aadhaar template versions; manual crop is provided as a fallback.
  • Address parsing relies on heuristics; please review the final address field.

Changelog (highlights)

  • v6: YAML config, branding & footer, stamping/merging, Gmail API sender.
  • v6 patch: better address cleanup, improved photo border detection.
  • v6 + manual crop: added interactive crop dialog (ratio lock), kept auto-crop.
  • stamp fix: corrected PyMuPDF insert_text call to use point tuple for footer.

License

Proprietary / Internal use. Update this section with your organization’s license or usage terms.