Portfolio Project · Public Kaggle Data · No Production Claims

AI-powered Support
Operations Optimization

A gDATA-style support operations case study that uses public support data, ML/NLP, selective LLM fallback, and human-safe routing policies to automate case routing and turn model confidence into actionable operating recommendations.

8,325
Structured Ticket Rows
Deduplicated Kaggle dataset
1,665
Held-out Benchmark Tickets
Stratified 80/20 split
+21.1 pts
Macro-F1 Lift
Improved ML vs keyword baseline
12.1% → 33.2% vs keyword baseline
4-Stage
Routing Cascade
Rules → ML → LLM → Human
Sweep
Cost–Coverage Threshold
Policy operating curve
Business Context

The Business Challenge

Support teams face competing goals: automation coverage, human review load, LLM API cost, and routing quality. A principled system makes those tradeoffs explicit and measurable.

Reduce Manual Review Load

Manual triage of every incoming ticket is costly and doesn't scale. Automation needs to be safe enough to handle high-confidence cases without burdening agents.

Control LLM Invocation Cost

LLM-only routing invokes a language model for every ticket — expensive and hard to cost-control at scale. Smart routing limits LLM calls to truly ambiguous cases.

Preserve Routing Quality & Safety

Misrouted tickets harm customer trust and operations. Uncertain cases and LLM failures must always land in human triage — the system should never fail silently.

Methodology

From Data to Decision

A structured pipeline that goes from raw public data to an actionable routing policy with explicit, measurable tradeoffs.

01
Public Support Data
Two Kaggle datasets: Twitter customer messages and a structured ticket dataset with metadata fields.
02
Label Provenance
Ticket Type metadata columns map to issue labels. Labels are never derived from model predictions or keyword rules.
03
Supervised Benchmark
Stratified 80/20 holdout split. Keyword, ML baseline, and improved ML each evaluated on 1,665 unseen tickets.
04
Routing Cascade
4-stage system: deterministic rules, calibrated ML, LLM fallback, human triage as the safety net.
05
Threshold Policy
Confidence sweep generates a cost–coverage operating curve to inform threshold selection.
06
Decision Artifacts
Queue distributions, threshold guides, benchmark metrics, and routing KPIs exported for review.
gDATA Alignment

Why this maps to Business Data Science

Designed to show the full loop from ambiguous support-ops problem to data-driven recommendation.

Business Problem Framing

Balances manual triage cost, automation coverage, LLM invocation cost, and routing quality.

Data Science Implementation

Uses public support data, metadata-derived labels, supervised benchmark, and threshold-sweep analysis.

ML/NLP + LLM Solutioning

Combines keyword rules, TF-IDF/Logistic Regression, selective LLM classification, and human fallback.

Actionable Recommendation

Recommends a hybrid routing policy and confidence-threshold selection based on risk, review capacity, and LLM budget.

System Design

4-Stage Routing Cascade

Each stage handles the cases it's best suited for. Ambiguous tickets pass downstream; uncertain cases always reach a human.

Incoming Ticket
Rule-based Routing
Calibrated ML
LLM Fallback
Human Triage
Stage 1

Rule-based Routing

Deterministic pattern matching for obvious cases — instant, zero model overhead, highest reliability.

Stage 2

Calibrated ML

TF-IDF + Logistic Regression with isotonic calibration. Auto-routes when confidence ≥ high threshold.

Stage 3

LLM Fallback

LLM performs issue-type classification for ambiguous low-confidence tickets. Not invoked for every ticket.

Stage 4

Human Triage

Middle-confidence ambiguity and LLM failures always route here. The safety net — never fails silently.

Not an LLM-only design. LLM calls are intentionally limited to low-confidence ambiguous cases to control API cost and improve operational safety. The majority of tickets are handled by rules or ML without any LLM invocation.

Interactive Demo

Try the routing logic

Paste or select a sample support ticket to see how the hybrid routing policy would classify and route it.

Sample tickets

Select a sample or type a ticket, then click "Route ticket"

Static front-end demo based on the project routing design. Full training, evaluation, and routing pipeline are available in the GitHub implementation.

Results

Results & Insights

Three views of the system: operational health, policy tradeoffs, and measured model quality. Metric types are clearly distinguished.

Proxy + Estimated Metrics

Operations Overview

Shows routing-stage distribution, human triage rate, LLM invocation rate, average confidence score, estimated cost per ticket, and queue distribution across all processed tickets.

Human triage rate is a routing-system proxy, not a downstream escalation rate. Cost estimates are analytic, not measured from live LLM calls.

Estimated Analytic Metrics

Cost–Coverage Policy Tradeoff

Shows how confidence thresholds shift tickets between auto-routing, LLM fallback, and human review. Thresholds are operating policy choices — not just model parameters — and must be selected with risk tolerance and LLM budget in mind.

Computed analytically from ML confidence scores. No live LLM calls required for this sweep.

Measured on Holdout Data

Model Evaluation

Measured ML vs keyword baseline on 1,665 held-out metadata-derived tickets. Improved ML raises Macro-F1 from 12.1% to 33.2% vs keyword rules. ML baseline's 59.4% accuracy reflects majority-class bias; the improved model trades raw accuracy for better class balance. Limitations are transparent.

Labels are metadata-derived (Ticket Type field), not human-reviewed. Absolute scores remain modest and should be interpreted cautiously.

Business Recommendation

Use a hybrid routing policy

rather than LLM-only or rules-only routing.

Keep deterministic rules for high-signal obvious cases — no model overhead, highest reliability.

Use calibrated ML for high-confidence routing — cost-efficient, fast, and measurable.

Trigger LLM classification only for ambiguous low-confidence tickets — not for every case.

Route uncertain cases and LLM failures to human triage — never fail silently.

Select confidence thresholds based on risk tolerance, review capacity, and LLM budget.

Decision Memo

Executive Decision Memo

What I would recommend to a support operations leader.

Recommendation

Use a hybrid routing policy rather than LLM-only routing.

Rationale

Rules handle obvious cases cheaply; calibrated ML handles high-confidence routing; LLM fallback is reserved for ambiguity; human triage protects quality.

Operating policy

Choose confidence thresholds based on routing risk, agent review capacity, and LLM budget.

Next validation

Build a human-reviewed eval set, add queue-level SLA requirements, and monitor calibration/drift before production rollout.

Stakeholder View

Stakeholder View

How the analysis translates for different partners in a support operations environment.

Support Ops

Use routing-stage and queue metrics to understand review load and staffing pressure.

Data Science

Evaluate ML against keyword baselines with measured accuracy, macro-F1, weighted-F1, and per-class metrics.

Engineering

Implement configurable thresholds, safe LLM fallback, and human-triage failure handling.

Leadership

Use cost–coverage tradeoffs to choose an operating policy aligned with risk tolerance and LLM budget.

Honest Assessment

Limitations & Next Steps

Transparent caveats are a mark of rigorous data science. Here is what this system does and does not claim.

Current Limitations

  • Metadata-derived labels are not human-reviewed production gold labels
  • Kaggle ticket descriptions are templated and may not fully reflect production support traffic
  • The supervised benchmark covers 3 of 6 routing classes reliably present in the raw Kaggle metadata
  • Absolute ML scores are modest; interpret cautiously against real-world benchmarks

Next Steps

  • Build a 500–2,000 row human-reviewed eval set for more reliable per-class F1 signal
  • Add queue-level SLA policies and risk-tier threshold settings per support category
  • Track true downstream escalation and resolution outcomes beyond routing decisions
  • Add calibration plots, reliability monitoring, and model drift detection
Implementation

Tech Stack

Built entirely with open-source tools on public Kaggle data. No proprietary data sources.

Data & ML
Python
pandas
scikit-learn
Logistic Regression
TF-IDF
NLP
Tokenization
Char n-grams
Text normalization
FeatureUnion
AI & LLM
Selective LLM fallback
Optional OpenAI API
JSON parsing
LLM fallback
Data Sources
Kaggle
Twitter support
Ticket dataset
Metadata labels
App & Viz
Streamlit
Matplotlib
seaborn
Next.js
Tailwind