15 Hours, One Rejection: A Support Operations Architecture

I wrote this as a job application case study. The company rejected me before we ever spoke.

Fifteen hours of work. No conversation. An email rejection.

Rather than letting those 15 hours go to waste, I'm publishing it. The thinking is sound and it shows what it looks like to walk into a broken support operation, figure out what's actually wrong, and design a fix. That process is the same whether the scenario is real or hypothetical.

$571K–$734K Annualized avoided hiring costs
11% → <8% Projected agent attrition reduction
<5 min Enterprise first response time target
133% Enterprise churn increase (baseline)
80-120/wk Retried import tickets automated
$10/mo Total AWS cost for 8 bridge services

The Situation

The diagnosis, the architecture, and the decisions below are mine. Here is what those 15 hours of work uncovered.

The numbers hit first:

  • Enterprise churn: 1.8% to 4.2% in two quarters — a 133% increase
  • Agent attrition: 11% per month, roughly five departures every month
  • Utilization: already above the 70-75% industry threshold
  • Enterprise clients sharing a queue with free-trial users
  • Email sitting behind every other channel

The company: a B2B SaaS business that scaled from 50 enterprise accounts to 300 in a year, with 500 more planned in the next six months. The support team grew from 8 to 45 agents to keep pace, but nobody has owned the system architecture since the original ops manager left nine months ago.

And there's no safety net. The 45-agent team is staffed 8am-6pm EST with no formal Tier 2 rotation and no after-hours coverage. Complex issues that arrive after 6pm sit in queue until the next morning. That gap shows up in the data as overnight silence on enterprise tickets, cancellation requests, and CSAT scores of 1.

The symptom list is long. But symptoms aren't the diagnosis.


The Constraints

This wasn't an open-ended redesign. The case study came with hard boundaries, and those boundaries shaped every decision I made.

  • Headcount: Up to 2 hires within 60 days, 3-week ramp assumed.
  • Budget: $2,500/month beyond existing Zendesk and Zapier costs.
  • Timeline: Measurable improvement in enterprise CSAT and first response time within 60 days.
  • API access: Zendesk Suite Professional, Salesforce (read-only), admin portal (read-only), Slack API.
  • Authority: Decision-making authority over hiring, tooling, architecture, and process only. Product roadmap, org structure, and engineering priorities are out of scope.
  • Engineering: No dedicated internal tools team. AWS access (Lambda, API Gateway, EventBridge, S3) available for lightweight self-built services. Core product changes require 6-8 week lead time and executive sponsorship.

I spent time with these constraints before touching the architecture. A $2,500/month ceiling immediately rules out a Zendesk Enterprise upgrade. Read-only Salesforce access means no write-back automation. A 60-day clock means you can't sequence everything neatly. And two hires with a three-week ramp means the team plan has to be tight. These aren't minor footnotes. They're the reason I chose bridge services over platform upgrades, and why the at-risk account interventions had to run in parallel with the technical build.


The Diagnosis

Before building anything, I needed to understand what was actually broken. The case study included a dataset of 30 recent tickets with timestamps, channels, categories, response times, CSAT scores, and agent notes. I went through every ticket looking for the root cause underneath the surface-level chaos.

Everything in the data collapsed into four distinct problems.

1. Support generates no structured data

Tickets are untagged, uncategorized, and trapped in a single undifferentiated queue. Product can't see that one specific import error accounts for 43% of all tickets, so the bug stays open and agents absorb it as manual headcount cost. A major enterprise client threatened to switch to a competitor after submitting three import complaints in two days. The pattern was visible in the ticket data. The system couldn't route it for action.

When your support operation doesn't produce structured data, every team downstream is flying blind. You can't route what you haven't categorized. You can't report on what you haven't tagged. And you definitely can't train AI on it.

2. There is no onboarding process

Enterprise SSO is configured during onboarding with no certificate expiration tracking and no validation checklist. The fallout is entirely predictable:

  • One client (250 seats) got locked out during a client presentation.
  • Another client (300 seats) lost 187 minutes to an after-hours SSO outage.
  • A third (500 seats) gave a 1-out-of-5 support rating and called it "the most disorganized onboarding we've ever experienced."

These aren't agent failures. They're system failures. A runbook would have prevented every one of them.

3. There is no triage or enforcement system

Agents choose what to work on. High-priority tickets average 28 minutes to first response. Low-priority tickets average 12 minutes. The priority field enforces nothing. Agents gravitate toward the quick wins, which means the hard problems wait.

The result? A 175-seat enterprise client submitted a billing inquiry at 10:30pm. It sat overnight unanswered.

They submitted a cancellation notice 18 hours later.

Here's what told me this was a systems problem, not a people problem: the organization had already run a two-week manual triage pilot. A team lead pulled eight agents into a dedicated enterprise queue and manually triaged tickets. First response time dropped from 11 minutes to 3 minutes. CSAT jumped from 68% to 88%. But general queue response time increased by 5 minutes, and the team lead spent 2.5 hours a day on manual triage and couldn't sustain it.

The pilot proved the team can deliver enterprise-grade service when the system supports it. My job was to make the system do what the team lead was doing manually.

4. Information is fragmented across too many tools

Agents switch between five or six tools before they can begin resolving a ticket. That's 5-6 minutes of overhead on every interaction. One ticket required three separate system lookups to find a refund policy that may have been outdated. Another required a two-hour engineering wait due to outdated internal docs. A pricing question required Salesforce, a Google Sheet, and a Slack confirmation.

This fragmentation is the primary driver of the 11% attrition. Agents aren't leaving because the work is hard. They're leaving because the work is unnecessarily frustrating.

At roughly $13,000 per departure, that's approximately $780,000 in annual attrition cost.


The Architecture

The instinct in most organizations is to buy something new. I went the other direction. I looked at what was already in the toolbox and unused.

The company had Zendesk Suite Professional. That includes SLA policies, skills-based routing, ticket forms, Explore analytics, AI agents, Knowledge Builder, App Builder, and CSAT surveys. None of it was configured. This is the equivalent of buying a car with all-wheel drive, navigation, and lane assist, then only using it to drive to the mailbox in first gear.

The sequencing logic mattered as much as the decisions themselves. I needed each step to create the conditions for the next: activate what's already paid for, bridge what the platform can't do natively, then build what needs to be permanent. Every extension gets a decommission date so nothing becomes accidental infrastructure.

Phase 1, Weeks 1-3: Activate, Configure, Bridge

The first three days aren't spent on configuration. They're spent in the queue, watching agents work, talking to the people who have been diagnosing this system for months. The ticket data tells you what's broken. The agents tell you why, and what they've already tried. Their observations shape every decision that follows.

In parallel, three at-risk enterprise accounts require immediate direct outreach. These clients can't wait for infrastructure. One had submitted a cancellation request citing support responsiveness. Another had escalated to our CEO. A third had given a 1-out-of-5 CSAT rating. Each gets a named account owner, a specific recovery offer, and a follow-up timeline. The account save plays and the technical build run simultaneously because the 60-day clock doesn't give you the luxury of sequencing them.

The platform configuration is sequenced so each step creates the conditions for the next:

Structured ticket forms. Before anything else, intake needs to produce structured data. I'm thinking about this as the foundation layer. Forms capture category, issue type, and priority at submission, which makes everything downstream (routing, reporting, AI deflection) possible. You can't analyze what you can't collect.

SLA policies and skills-based routing. Enterprise tickets auto-tag via Salesforce sync and route to a dedicated enterprise agent group. SLA enforcement applies equally to email and chat, closing the channel gap that lets email sit behind everything else regardless of account tier or urgency. This is where the pilot results become permanent. The manual triage that proved a 68%-to-88% CSAT improvement gets automated, without degrading the general queue.

AI agent deflection. Suite Professional includes AI agents out of the box. One support agent had already written the workaround macro for the most common import error. I wire that to the AI agent and handle 80-120 retry tickets per week automatically, recovering 15-20 agent-hours weekly.

CSAT reconfiguration. Kill the third-party integration running $600/month. Replace with Zendesk's native CSAT surveys and a trigger that alerts only on scores below 2. Same signal, near-zero cost.

Layout builder. Agents get a streamlined view with enterprise context surfaced inside the ticket interface. Four tool tabs become one pane. This directly attacks root cause #4 (fragmentation) and reduces the 5-6 minutes of overhead per ticket.

Kill the Zapier automations. Three were running: a Slack notification on every new ticket (channel muted by most of the team), a Google Sheet append on every close (200K+ rows, loading slowly), and a CSAT-to-Slack relay (hitting rate limits twice that month). These look like automation. They aren't. All three get replaced by native Zendesk functionality that produces structured data instead of Slack messages and spreadsheet rows.

AWS Bridge Services

Suite Professional doesn't cover everything. After-hours routing, Salesforce enrichment, Stripe webhooks, certificate monitoring: these gaps are real. But the $2,500 budget rules out a platform upgrade, and these are problems I can solve with targeted code.

I build eight lightweight Lambda functions behind API Gateway. Each one solves a specific gap. Each one has a planned decommission date tied to the native feature or organizational decision that will eventually replace it.

Bridge Service Problem It Solves Decommissions When
Intelligent Escalation Router Enriches alerts with Salesforce context, follows up at 30 and 60 min Native escalation SLAs (Phase 2)
After-Hours Auto-Response Replaces overnight silence for enterprise tickets On-call rotation decision
SSO Health Monitor Polls for SSO health, auto-creates tickets, tracks cert expiry Engineering SSO evaluation
Stripe Billing Monitor Auto-voids duplicate invoices, creates enriched billing tickets Native billing automation
Agent Context API One click replaces four system lookups App Builder sidebar (Phase 2)
Enterprise Import Quality Monitor Tracks import queue status, auto-creates pre-enriched tickets Product engineering fix
Client-at-Risk Scoring Daily risk scores from ticket trends and renewal proximity Explore dashboards (Phase 3)
Problem Management Report Weekly aggregation by category and resolution pattern Explore dashboards (Phase 3)

Total AWS cost for all eight: under $10/month. Lambda free tier covers the volume. Compare that to the $3,600/month Zendesk Enterprise upgrade that would solve the same problems with native features.

Phase 2, Weeks 3-5: Build and Integrate

Phase 1 is where the hard decisions live. Phases 2 and 3 are execution against the plan. The thinking is done, the sequencing is locked, and what follows is replacing the bridges with permanent infrastructure — and decommissioning what was never meant to stay.

The sidebar app (built via App Builder) replaces the Agent Context API bridge. One pane embedded in the agent workspace, pulling Salesforce and admin portal data without leaving the ticket. When it ships, the Lambda decommissions. This is the pattern repeating: bridge, validate, replace.

Reporting migrates to Explore. The 200K-row Google Sheet retires. Knowledge Builder launches, generating KB articles from accumulated ticket data (starting with the known issues that currently have no documentation) and migrating policy docs from the Sheet with a 30-day review cycle.

Phase 3, Weeks 5-8: Scale and Harden

Operational dashboards. Weekly problem review cadence. Billing workflow standardization. Proactive SSO evaluation. Enterprise onboarding playbook migration to a client-facing self-service portal.

Onboarding comes last because it depends on everything before it. You can't build a self-service portal until the knowledge base exists, and the knowledge base can't generate useful content until the ticket forms have been collecting structured data for weeks. The client who called their onboarding "the most disorganized we've ever experienced" represents every client who didn't complain. Phase 3 builds the system that prevents the next one.


What I Would Not Build

The discipline of restraint matters as much as the build plan. In a 60-day window with a $2,500/month budget, every decision to build something is also a decision not to build three other things. Here's where I deliberately said no, and why.

No Zendesk Enterprise upgrade. The incremental cost ($80/agent x 45 agents = $3,600/month) exceeds the $2,500 tooling budget. The Enterprise-only features that would justify it (after-hours routing, queue discipline, contextual agent views) are exactly what the bridge services solve at near-zero cost. The $2,500 stays in reserve for unanticipated needs during the 60-day window.

No Zendesk Copilot add-on. No labeled training data exists yet. The ticket forms I'm building in Phase 1 solve categorization deterministically. Copilot is the right move after 60+ days of categorized data, when there's something to train on. The AI agents already included in Suite Professional handle customer-facing deflection immediately without needing historical training data.

No permanent custom infrastructure. The previous ops manager built custom workarounds with no plan to sunset them. That's how you end up with accidental infrastructure nobody can maintain. Every Lambda bridge has a decommission date. I build bridges that self-destruct when the platform catches up.


The Team

The case study allowed two hires. The obvious roster would be a Data Analyst and a Zendesk Admin. I went a different direction because those obvious choices assume the infrastructure is already built. It isn't.

I thought about this as: what needs to happen in the first 60 days, and who can actually do that work? Three categories: platform architecture (Zendesk configuration, Salesforce sync, App Builder), process layer (agent training, enterprise workflows, problem review cadence), and bridge engineering plus cross-functional alignment (Lambda builds, product escalation, CS coordination). That third category is my role. So the question becomes: which two people cover the other two?

Me. Strategy, roadmap, cross-functional alignment, and the AWS bridge services. I own the eight Lambda builds, the product escalation to get engineering eyes on the import bug, the CS alignment to coordinate at-risk account recovery, and the direct account interventions. I'm also the PM for the overall program.

Solutions Architect. The Zendesk Professional platform: routing, SLAs, ticket forms, App Builder sidebar, Salesforce sync, Knowledge Builder, and Explore dashboards. I chose this role over a Zendesk Admin because this isn't ticket form setup. This person needs to understand API integrations well enough to build the App Builder sidebar, maintain the Lambda bridges during transition, and eventually decommission them as native features replace them. That's platform strategy.

Operations Lead. The process layer and the support team interface: managing the client-at-risk response workflow, running the weekly problem review from automated reports, training agents on new tools, and owning the enterprise onboarding checklist. This role exists because the best infrastructure in the world fails if the team doesn't adopt it. Someone needs to be in the room with agents every day, translating the architecture changes into new habits.

Roles I passed on and why

A Data Analyst. This sounds logical until you look at the current state: no tagged tickets, no categorized data, no structured reporting. You can't analyze what you can't collect. The collection layer doesn't exist yet, and that's what Phase 1 builds. Hire for analysis after 60+ days of structured data, when there's something meaningful to work with. Hiring an analyst now means paying someone to wait.

A Developer. Eight Lambda functions sounds like a dev hire. But these aren't complex applications. They're targeted glue code: API calls, data enrichment, scheduled triggers. Each one is 50-200 lines with a planned expiration date. The Solutions Architect maintains them. A dedicated developer would be underutilized within weeks, and their instinct would be to build more rather than decommission what's already working.


The Measurement

Each target below maps directly to one of the four root causes from the diagnosis. These are projected improvements, not guarantees. I based them on the data available in the case study (particularly the pilot results) and typical patterns I've seen in similar support operations.

  • Enterprise first response time: Median below 5 minutes. The two-week triage pilot showed a drop from 11 minutes to 3 minutes with dedicated enterprise routing. Automating that routing through SLA policies and skills-based assignment should replicate or exceed those results without the unsustainable manual overhead.
  • Enterprise CSAT: Above 80%. The same pilot pushed CSAT from 68% to 88%. I'd expect the automated version to land in that range, with further improvement as the knowledge base and onboarding playbooks mature.
  • General queue FRT: Stable or improved. The pilot degraded general queue response by 5 minutes because it pulled 8 agents from the main pool. Skills-based routing avoids that tradeoff by dynamically assigning agents.
  • AI deflection rate: 10-15% of chat/email volume resolved without human intervention, based on the import retry error pattern alone (80-120 tickets/week).
  • Agent attrition: Trending below 8%/month from 11%, driven by AHT reduction and fewer repetitive retry-error tickets.

The Growth Math

The company plans to add 500 enterprise accounts over 6 months. The baseline reality is stark:

  • Current capacity: 45 agents at 78% utilization (already above industry targets).
  • Current load: ~27 tickets per agent per day at a 14-minute average handle time.
  • Status quo projection: Adding 500 accounts means 600-800 new daily tickets, requiring 22-30 net-new agents just to tread water.

The architecture changes that math entirely:

  • The App Builder sidebar eliminates 5-6 minutes of tool switching per ticket, dropping AHT to ~9 minutes.
  • AI handles the repetitive retry errors, driving a 10-15% deflection rate on chat and email.
  • New projection: The same volume now requires only 8-12 additional agents.

That's 14-18 hires avoided.

At $3,400/month fully loaded per agent, that is $571K-$734K annualized in avoided hiring costs.

(Note: This is a projection, not a promise. It assumes new enterprise accounts generate tickets at current rates, before onboarding improvements reduce per-account ticket volume. The actual number depends on how effectively the infrastructure scales.)


The Broader Pattern

I've worked in managed service providers for most of my career. Every one of them had some version of this problem. The details change. The root cause doesn't.

Bad data. Every time.

I've watched companies throw headcount at it. Hire six more agents, then wonder why response times didn't improve. I've watched them build custom tooling on top of broken processes, automating the wrong workflows faster. I've watched them deploy AI on unstructured ticket data and get confidently wrong suggestions that agents learned to ignore within a week.

None of it worked until the data layer was fixed. You can't route what you haven't categorized. You can't report on what you haven't tagged. You can't train a model on garbage and expect intelligence.

Fix the data collection layer first. Build temporary infrastructure with exit plans built in. Bridges that self-destruct when the platform catches up. Keep the permanent build list shorter than you think it needs to be.

And spend the first three days in the queue. Not building. Watching.

The team that's been absorbing this dysfunction knows exactly what's broken. Listening to them is the most important infrastructure decision you can make.


FAQ

How do you prioritize when everything feels broken at once?

Map all issues to root causes. This company had over a dozen visible symptoms, but they collapsed into four root causes. Fixing those four resolves most of the symptoms automatically. Start with the root cause that generates the most downstream problems. In this case, the absence of structured data, since everything else (routing, reporting, AI deflection) depended on it.

What would you do differently with a bigger budget?

Skip the bridges. A Zendesk Enterprise upgrade at $3,600/month gives you native after-hours routing, advanced queue management, and contextual workspaces without writing a line of Lambda code. I'd also move faster on Zendesk Copilot, since Enterprise includes the labeled data infrastructure that makes Copilot useful instead of decorative. The architecture stays the same. The implementation gets faster because you're not building temporary workarounds.

What would you do if the first 30 days didn't produce results?

Check the data layer before changing anything else. The structured ticket forms and SLA routing are the foundation of this entire architecture. If those aren't generating clean, categorized data, nothing downstream works — not the routing, not the AI deflection, not the reporting. That's the first diagnostic question.

If the forms are generating clean data but metrics aren't moving, I'd go back to the queue. Not to pull reports — to sit with agents and watch them work. The triage pilot showed what's possible when the system supports the team. If the automated version isn't replicating those results, the most likely explanation is incomplete adoption, not a broken architecture.

The one thing I wouldn't do is hire. Headcount is a lever of last resort when the system is still broken. Adding agents to a broken system makes it harder to fix — and masks the real problem long enough for it to become permanent.