Portfolio | Automated AI ETL Pipeline
Data Engineering & Automation

Automated ETL Pipeline & AI-Powered Data Extraction

Designed and developed an end-to-end automated ETL pipeline to process unstructured student certificate data (Images & PDFs) into a structured relational database, featuring a custom Human-in-the-Loop (HITL) system for 100% accuracy.

Python Google Gemini AI Selenium Apps Script API

90%

Reduction in manual data entry time

100%

Data accuracy via HITL Quality Control

Scalable

Automated batch processing architecture

The Problem

  • Time-Consuming: Admin staff spent hours manually downloading files, reading details, and typing them into a web form.
  • Prone to Error: Inconsistent date formats, misspelled student names, and incorrect point allocations degraded database quality.
  • Unscalable: Bottlenecks occurred during peak submission periods, delaying student portfolio updates.

The Solution

  • Intelligent Extraction: Integrated Google Gemini LLM to automatically parse text, dates, and intent from unstructured images and PDFs.
  • Data Transformation: Utilized Python's difflib for fuzzy matching to perfectly map misspelled names to official database IDs.
  • Hybrid Automation: Built an innovative pause-and-review system via Google Sheets API, allowing human auditing before executing Selenium web injections.

System Architecture

Entering the Site & Extracting Data with AI

Phase 1 is where the system automatically logs into the portal, accesses the ticketing system, and gathers all necessary files (Drive links, PDFs, Images). It then uses Google Gemini AI to analyze the visual certificates and extract structured JSON data according to strict rules.

# Strict Prompt Engineering for structured output
prompt = """
Analyze this image carefully. You MUST return a strictly formatted JSON ARRAY.
Data Extraction Rules:
- 'category' should be 'STEAM Exposure' or 'Global Insight'.
- 'date': Format strictly as YYYY-MM-DD.
- 'time': Format strictly as HH:MM:SS.
"""

response = model.generate_content([prompt, Image.open(file)])

Quality Control (QC) Checkpoint

Phase 2 acts as a Quality Control checkpoint to ensure 100% data accuracy. The system pushes the AI-extracted data to a temporary Google Sheet, pausing the script to allow a human reviewer to audit and edit any incorrect fields before proceeding.

# Sending temporary data to QC Google Sheet
requests.post(QC_SHEET_URL, json={"action": "write", "data": pending_entries})

# Pausing the Python script for Human Review
input("\nšŸ› ļø PRESS [ENTER] HERE WHEN YOU ARE DONE EDITING IN GOOGLE SHEET... ")

# Pulling back the edited data from Google Sheet
response = requests.get(f"{QC_SHEET_URL}?action=read")
qc_data = response.json()
pending_entries = qc_data["data"]

Injecting Final Data into Web Forms

In Phase 3, the system automatically inputs the finalized data into the web portal. It uses Selenium to dynamically fill out the student name, event type, event description, exact event date/time, and the awarded event points.

# Automated Data Input to Website Database via Selenium
Select(driver.find_element(By.NAME, "event_type")).select_by_visible_text(entry['category_ai'])
driver.find_element(By.NAME, "event_description").send_keys(desc)
driver.find_element(By.NAME, "point").send_keys(str(entry['poin_ai']))

# Combine & Inject Datetime bypassing front-end restrictions
EVENT_DATE = f"{raw_date}T{raw_time}"
driver.execute_script(f"""
    var dateField = document.getElementById('date');
    dateField.value = '{EVENT_DATE}';
    dateField.dispatchEvent(new Event('change', {{ bubbles: true }}));
"""
)

Skills Demonstrated

Advanced Data Cleaning

Used Python Regex and Datetime parsing to handle highly unstructured anomalies outputted by AI, enforcing strict schema compliance.

Data Quality Assurance

Recognized AI hallucination risks and proactively architected a REST API bridge to Google Sheets for human-level auditing (HITL).

API Integration & Fault Tolerance

Engineered a robust API key rotation system and exponential backoff to handle HTTP 429/500 errors, ensuring batches never crash midway.

LLM Prompt Engineering

Constructed strict zero-shot prompts to guide multimodal models (Gemini) into parsing spatial layout documents into JSON schema.

Full Source Code

Built as a portfolio showcase. Technologies: Python, Gemini, Selenium, REST API.