Back to insights
Workflow AutomationProfessional Services

Inside the Firewall: A Working Architecture for Private AI Workflows in Confidentiality-Bound Firms

By Principal ConsultantMay 18, 202616 min read

Part of: Agentic Workflows

AI summary

A working architecture for private AI workflows in confidentiality-bound firms — law, accounting, wealth management, healthcare administration, family offices — where client data cannot be sent through public APIs. Develops a three-tier private deployment architecture: Tier 1 workstation class ($3–8k), Tier 2 prosumer workstation ($8–15k), Tier 3 server class or private cloud ($30–100k or $1–5k/mo dedicated endpoint). Includes a model × task fitness matrix mapping ten task types to each model class with explicit fitness ratings, plus three representative workflow architectures — one per tier — covering small-model intake routing for a CPA practice, mid-model contract review for a commercial law firm, and large-model agentic tax-memo synthesis for a regional accounting firm. Includes three data visualizations and closes with a 90-day deployment pattern and a CTA.

Rows of ornate brass safe-deposit boxes engraved with numbers, photographed at a tight oblique angle showing the keyhole on each door
Custody is not a feature you bolt on. It is the architecture you build.

For the firms whose business is structurally built on confidentiality — law practices, accounting firms, wealth managers, healthcare administrators, family offices — the public-API model of AI consumption is incompatible with the work. The choice is not between AI and no-AI. The choice is between a private workflow architecture that runs inside the firm's own perimeter and an unsanctioned shadow economy of personal-laptop chatbot use that is both happening anyway and uninsurable when it leaks. The economic argument for private AI was decisively settled in the eighteen months ending in early 2026. The regulatory argument was settled before that. What remains is the operating decision: stand the architecture up deliberately, or wait for an audit, a regulator, or a client to force it under time pressure.

The regulatory and professional-conduct surface has converged quickly. The published opinions from state bar associations on attorney use of generative AI uniformly require informed client consent before transmitting client confidences to a third-party AI vendor, and an increasing number explicitly bar the transmission absent an enforceable confidentiality protection that the public API does not provide. CPAs operate under analogous Circular 230 confidentiality obligations and state-board guidance. Healthcare-adjacent firms operate under HIPAA's protected-health-information restrictions. Financial advisors operate under Regulation S-P. The common thread across every regime is the same: the firm bears residual liability for any data leak, and the argument that the AI vendor's terms of service prohibit retention of customer data is not a defense that survives examination — because the firm cannot prove what was transmitted, what was retained against policy, or what was accessed by anyone other than its intended recipient. Public-API AI use in these firms is, increasingly, a compliance posture the firm cannot afford to maintain.

The architectural response is private AI — a workflow in which inference runs on hardware the firm controls, the model weights are open-weight (the major families: Llama, Qwen, Mistral, DeepSeek, Phi, Gemma), and no client data crosses an external boundary at any point in the pipeline. The economics of this architecture have shifted dramatically over the past two years. What required a two-hundred-thousand-dollar server cluster in 2024 runs on a five-thousand-dollar workstation in 2026. The frontier-open models that ship in 2026 are within five to fifteen points of closed-frontier performance on the tasks that matter for legal, accounting, and financial work — and the gap is closing every quarter. The capability bar that justifies the private-AI architecture has moved into the reach of the typical small-to-mid professional firm, and the operating economics now favor it across most workload profiles.

Capability score against a representative confidentiality-bound work benchmark — open-weight private deployments now sit within five to fifteen points of closed-frontier, and closed-frontier is not available for the work.

Illustrative · synthesis of public benchmark results and observed engagements across legal, accounting, and financial workflows

Three hardware tiers. A pragmatic deployment pattern at middle-market scale sorts into three tiers, each defined by the capability ceiling of the models it can host at acceptable latency. The tier choice is determined by the workload's complexity, not by the firm's headcount. A five-attorney boutique with a transactional practice may require Tier 3, while a thirty-attorney generalist firm may run effectively on Tier 1. Tier 1 — workstation class. Hardware: an Apple Mac Mini Pro M4 with sixty-four gigabytes of unified memory, or an equivalent PC workstation with an Nvidia RTX 4090 or RTX 5090 (twenty-four to thirty-two gigabytes of VRAM). Upfront cost: three to seven thousand dollars. Models hosted at fast latency: Llama 3.2 3B, Phi-4 mini (3.8B), Qwen 2.5 7B, Gemma 9B, Mistral 7B. Capability: strong on classification, field extraction, structured-output tagging, document routing, and embedding generation. Limited on complex multi-step reasoning and nuanced drafting. Latency for a typical inference call: under one second. Power draw: forty to two hundred fifty watts. The tier sits behind the receptionist's desk or in the firm's IT closet, absorbing the high-volume, low-judgment work that consumes paralegal and administrative time.

Tier 2 — prosumer workstation. Hardware: a Mac Studio M4 Max with one hundred twenty-eight gigabytes of unified memory, or a PC workstation with an Nvidia A6000 (forty-eight gigabytes of VRAM) or dual RTX 5090s (sixty-four gigabytes combined). Upfront cost: eight to fifteen thousand dollars. Models hosted well: Qwen 2.5 32B, Mistral Small (24B), Gemma 27B, DeepSeek-R1-distill 32B, Llama 3.3 70B at four-bit quantization. Capability: solid on drafting routine correspondence, multi-document Q&A, summarization, retrieval-augmented generation over a curated firm corpus, and simple agentic orchestration of a small toolset. Limited only on the most nuanced complex-reasoning work where a closed frontier model still leads. Latency: one to three seconds per call. This is the tier where the bulk of partner-adjacent professional work becomes addressable, and a single Tier 2 workstation can saturate the inference load of fifteen to thirty professional users in a typical week.

Tier 3 — server class or private cloud. Hardware: a multi-GPU workstation (two A6000s for ninety-six gigabytes of VRAM, or two H100s for one hundred sixty gigabytes), or a rented private endpoint from a HIPAA-compliant inference vendor running on dedicated hardware that the firm contracts for exclusively. Upfront cost: thirty to one hundred thousand dollars purchased, or one to five thousand dollars per month leased as private cloud. Models hosted at full quality: Llama 3.3 70B full precision, Qwen 2.5 72B, DeepSeek-V3 (671B mixture-of-experts), and the latest open-frontier releases. Capability: addresses the complex-reasoning work — contract review with edge-case nuance, M&A diligence synthesis, multi-step agentic chains, complex tax memos with citation, legal research with grounded citation chains. Latency: two to ten seconds per call for the deepest reasoning chains. This tier exists where the work itself is partner-grade and the volume is high enough to justify the capital outlay, typically a firm of forty or more professionals, or a smaller firm with an unusually high-complexity practice.

The model-by-task fitness matrix. The mapping from model class to task fitness is the single most consequential design choice in any private AI deployment. The wrong choice — running a seventy-billion-parameter model where a seven-billion-parameter model would suffice, or vice versa — wastes capital in one direction and produces unreliable output in the other. The fitness map below names the recommended deployment for each task type across the three tiers; the bolded cell in each row is the deployment that maximizes the price-to-capability ratio for that task.

A few patterns are robust across deployments. The first: classification, extraction, and routing tasks scale down to the small models cleanly, and using a frontier model for them is structurally wasteful. The second: the inflection point where complex reasoning becomes reliable sits around the thirty-billion-parameter mark in the open-weight series — below it, multi-hop synthesis is unstable; above it, the gap to closed-frontier closes quickly. The third: retrieval (the R in RAG) is genuinely solved at the embedding-model layer, where a one-hundred-million-parameter embedding model produces vectors the rest of the workflow can ground against; the inference model only has to consume the retrieved chunks, not produce them. Most deployments therefore mix model sizes inside a single workflow — a small embedding model, a small classifier, and a mid-or-large model for the consequential output — rather than picking one model for the entire pipeline.

Model × task fitness matrix — what each open-weight class is actually good at, with the recommended deployment tier in bold.

Illustrative · Sovereign Action analysis of open-weight model performance against representative confidentiality-bound workloads, 2026
Task typeSmall (3–9B)Medium (14–32B)Large (70B+)
Classification & routingStrongStrong (overkill)Strong (overkill)
Structured field extractionStrongStrongStrong (overkill)
Redaction & PII taggingStrongStrongStrong
Single-document summarizationAdequateStrongStrong
Multi-document Q&ALimitedSolidStrong
Templated draftingAdequateStrongStrong
Complex memo draftingInadequateSolidStrong
Multi-step reasoningInadequateAdequateStrong
Agentic tool-use chainsLimitedSolidStrong
Citation-grounded researchLimitedSolidStrong

Three workflow architectures — one per tier. The hardware tier and the model class together determine the architectural shape of the workflow. Three concrete patterns, one per tier, cover the bulk of confidentiality-bound deployments at middle-market scale. Workflow A — small-model intake and routing (Tier 1). A representative ten-CPA accounting practice receives twenty to two hundred inbound documents per week during tax season: W-2s, 1099s, K-1s, brokerage statements, receipts, client correspondence, IRS notices. The historical workflow is manual paralegal triage — open each, classify, route to the responsible CPA, file in the client folder. The private-AI version: an inbound watcher pulls each document through OCR run locally; a seven-billion-parameter classification model categorizes by document type and tags the responsible client (using a small local RAG index of client identifiers); a redaction layer masks anything outside the agreed processing scope; the document routes to the correct CPA's inbox with metadata pre-filled. The named maintainer reviews the routing queue daily. Hardware: a single Mac Mini Pro M4 at approximately thirty-five hundred dollars on the office network. Eval suite: golden set of two hundred historical documents with expert-validated routing decisions, weekly production sample. The workflow absorbs roughly fifteen to twenty hours per week of senior paralegal time at full deployment.

Workflow B — mid-model drafting and RAG (Tier 2). A representative twenty-five-attorney commercial law firm runs a contract-review queue: NDAs, MSAs, vendor agreements, employment letters, lease amendments. The historical workflow is associate review against a partner-approved playbook, partner approval, redline drafted, client sent. The private-AI version: a contract intake step normalizes the document and extracts metadata; a thirty-two-billion-parameter model performs first-pass clause-by-clause comparison against the firm's playbook; deviations are flagged with severity scores and category labels; the model drafts proposed redlines; the case routes to the responsible attorney with the playbook citation chain visible. Every output writes to the firm's document management system with full eval-log provenance. Hardware: a single Mac Studio M4 Max at approximately ten thousand dollars running Qwen 2.5 32B or Llama 3.3 70B quantized to four bits. Eval suite: golden set of one hundred expert-reviewed contracts per major category, weekly production sample at five percent. The workflow compresses associate review from two to three hours per contract down to twenty to forty minutes of focused attorney review on what the system has surfaced.

Workflow C — large-model agentic synthesis (Tier 3). A representative regional accounting and advisory firm runs a complex tax-memo workflow during quarter-end: estate planning memos, K-1 reconciliations, multi-state apportionment analysis, regulatory-change impact assessments. The historical workflow is partner-level research with associate support, drafted memo, partner review, client delivery. The private-AI version: an agentic chain in which a seventy-billion-parameter model orchestrates retrieval against the firm's tax research library (the major reference services plus the firm's own prior memos), runs multi-step reasoning over the citations, drafts the memo with full citation grounding, and routes to the responsible partner with a reviewer dashboard showing the reasoning chain, the citations consulted, and the eval score against the firm's golden set of historical memos. Hardware: a multi-GPU workstation with two A6000s at approximately twenty-five thousand dollars, or a rented private endpoint on dedicated hardware at approximately twenty-five hundred dollars per month. The workflow takes a memo that historically required ten to twenty professional hours and compresses partner time to two to three hours of focused review.

Break-even math. The financial comparison between private AI and public-API AI is not a single calculation; it is a curve. The variables are upfront hardware cost, monthly operating cost (electricity, software maintenance, the named maintainer's time), and the firm's monthly token volume. Below a certain volume threshold, the public API is cheaper. Above it, the private deployment pays back, and the payback period compresses as volume grows. At a representative mid-firm load of four hundred million tokens per month — roughly fifteen to twenty-five professionals in active use — the break-even points sit at approximately three months for Tier 1, six months for Tier 2, and thirty-one months for Tier 3. The thresholds shift down meaningfully if the firm is using a frontier-class closed model (Opus-class pricing) and shift up if the firm is on a discounted enterprise contract.

Cumulative 36-month cost at a representative mid-firm load (~400M tokens/month) — public-API spend vs. three private-AI tiers, with break-even crossovers visible.

Illustrative · synthesis of frontier API pricing, current consumer GPU and Mac Studio retail pricing, observed inference operating costs, 2026

But the financial calculation is only one of three the firm is making, and it is not the most consequential. The second calculation is compliance posture. The firm that ships private AI has a defensible answer to a regulator, an auditor, or a client asking how client data is processed; the firm that ships public-API AI has, under the published guidance now, an exposed liability surface and an examination posture that depends on the vendor's word rather than the firm's architecture. The third calculation is demand-signal capture — the firm that runs its own infrastructure absorbs the operator demand it would otherwise hand to the shadow economy, with all of that economy's compliance exposure and demand-signal blindness. The financial break-even matters. The firms that wait until pure cost-parity is reached are the firms that arrive late to a structural advantage their faster competitors will have built two years of context against.

Workload positioning — each representative confidentiality-bound workflow plotted on complexity × volume, with the optimal hardware tier emerging from the quadrant.

Illustrative · synthesis of observed deployments across professional-services, legal, accounting, and advisory firms, 2026

Building the deployment — the 90-day pattern. Weeks one through three — choose the tier. Audit the firm's actual workload across three buckets: classification volume, drafting volume, and complex-reasoning volume. Compare against the fitness matrix above. Pick the tier that addresses the largest aggregate hours; over-provision by twenty-five percent for headroom and growth. Weeks four through six — procure and rack. Purchase or lease the hardware; install the open-weight model stack (Ollama or vLLM as the inference runtime, LangGraph or a similar agentic framework if the workflow is multi-step); stand up the local RAG index over the firm's document corpus with a permissioned access layer. Weeks seven through ten — build the workflow. Implement the input pipeline, the model calls, the system-of-record write-back, the supervisory loop, and the eval suite with the firm's golden set. The eval suite has to be designed in at construction — bolting it on later structurally fails, as the eval-discipline analysis in this library develops. Weeks eleven through twelve — pilot and iterate. Deploy to a single team. Run the eval suite weekly. Measure operator trust, exception rate, and write-back integrity. The firm exits the quarter with a working private-AI workflow under its own perimeter, an eval surface that survives examination, and a named maintainer accountable for the next iteration.

The decision. For confidentiality-bound firms the choice is not whether to adopt AI. The shadow economy already adopted it for them. The choice is whether the firm will own the architecture — the hardware, the models, the eval surface, the audit trail — or continue to operate under the implicit assumption that an unsanctioned personal-laptop chatbot fluency will not eventually surface as the firm's most expensive disclosure event. The economics have moved. The capability bar has moved. The regulatory bar has moved. What remains is the operating decision: stand up the private workflow before a client, a regulator, or an audit forces the firm to do it under time pressure — or stand it up now while the firms with the most compounding context are the ones who built it early. Custody is not a feature you bolt on. It is the architecture you build, and the firms that build it first acquire a structural confidentiality advantage that the market will recognize before the next regulatory cycle.

Working with Sovereign Action. Sovereign Action specializes in private-AI workflow architecture for confidentiality-bound firms: hardware spec'd to the actual workload, open-weight model selection per task, eval suite designed in at construction, full audit trail, named maintainer. If your firm is past the experimentation phase and ready to move client-sensitive work off the public API, the right next step is a [forty-five-minute fit call](/fit-call). The conversation is direct, no fee, no deck, and the outcome is a clear read on which tier fits the actual workload, what the upfront cost looks like at your scale, and how the architecture maps to your specific compliance posture. For firms commissioning a complete first deployment, [the productized first-workflow build](/first-workflow) ships with the full eval suite and a working private-AI architecture at construction, on a fixed-price six-week engagement.

Key takeaways
  • For confidentiality-bound firms (law, accounting, wealth management, healthcare administration, family offices), public-API AI is increasingly a compliance posture that does not survive examination — the choice is private architecture vs. unsanctioned shadow use
  • The economics moved decisively in the past 18 months: what required a $200k server cluster in 2024 runs on a $5k workstation in 2026, and open-weight models sit within 5–15 points of closed-frontier on the tasks that matter for legal, accounting, and financial work
  • Three pragmatic hardware tiers: T1 workstation ($3–8k, hosts 3–9B models, absorbs classification/extraction/routing work), T2 prosumer workstation ($8–15k, hosts 14–70B quantized, addresses drafting + RAG), T3 multi-GPU server or private cloud ($30k+ or $1–5k/mo, hosts 70B+ full precision, addresses complex reasoning)
  • The model × task fitness matrix is the most consequential design choice — wrong choice wastes capital one way and produces unreliable output the other; most workflows mix model sizes (small embedding + small classifier + mid-or-large inference)
  • Three workflow archetypes — one per tier: small-model intake routing for a CPA practice (T1), mid-model contract review for a commercial law firm (T2), large-model agentic tax-memo synthesis for a regional accounting firm (T3)
  • Break-even at ~400M tokens/month: T1 in 3 months, T2 in 6 months, T3 in 31 months — but the financial break-even is only one of three calculations; the compliance posture and the demand-signal capture matter more
  • 90-day deployment pattern: choose the tier (weeks 1–3), procure and rack the hardware + stack (4–6), build the workflow with eval suite designed in at construction (7–10), pilot and iterate (11–12)
Decks for your vertical

Each deck carries the workflow patterns, use cases, and control posture specific to one industry. Open the slide reader or download the PPTX.

Apply this

Book a diagnostic and we'll discuss how these ideas apply to your workflow.

Book diagnostic