Synthetic data generation has evolved from rule-based systems to AI-powered agents that can create realistic test data with minimal configuration. This transformation addresses a critical challenge for platform teams: providing developers with quick, compliant, and self-service access to data when production systems remain off-limits due to security and compliance requirements.
TL;DR: Main insights
- Agentic AI generates synthetic data by writing and executing code autonomously, making it faster and more scalable than traditional LLM token-based generation
- Platform teams can use synthetic data to unblock innovation when production data access is restricted by compliance or security policies
- Tonic Fabricate's workflows feature automates repeatable tasks like pushing data to APIs, integrating synthetic data generation directly into CI/CD pipelines
- The tool supports multiple output formats including databases, JSON, PDFs, and documents, making it versatile for testing, demos, and AI model development
Mark Brocato is a web framework enthusiast, JavaScript advocate, Rubyist, and the creator of React Storefront and Mockaroo. As head of engineering for Tonic's Fabricate product, he leads a small team within Tonic focused on synthetic data generation from scratch.
You can watch the full discussion here if you missed it, or check out the recap below:
The data access problem: Why synthetic data matters
Even large companies with vast data repositories face a critical bottleneck: developers, QA testers, and sales teams often cannot access production data due to compliance restrictions. Mark explains, "Do the developers have access to it? Do the salespeople have access to it? Do the QA testers have access to it? In many cases, no, especially for businesses that are in more sensitive domains like financials, healthcare, anybody who's regulated."
This creates a paradox where organizations have the data they need but cannot use it for development, testing, or troubleshooting. Synthetic data solves this by creating realistic alternatives that preserve the statistical properties and relationships of production data without exposing sensitive information.
Tonic addresses this challenge with three complementary products:
- Tonic Structural - Deidentifies production data by identifying sensitive information and replacing it with synthetic substitutes, allowing teams to use production-like data in lower environments
- Tonic Fabricate - Generates data purely from scratch based on schemas and requirements, ideal for new features, products, or when production data access is too risky
- Tonic Textual - Redacts and synthesizes sensitive data in text documents for training large language models, particularly useful for fraud detection and document processing
Three approaches to synthetic data generation
Mark outlines three fundamental approaches to creating synthetic data, each with distinct advantages:
Production data deidentification uses existing production data and replaces sensitive values with synthetic substitutes. This approach automatically preserves nuances and relationships that you might not even be aware of in your production data. "This field tends to correlate with this other field. There's nothing in your schema that says that that should happen. It's just that reality says that that should happen," Mark notes. If you have production data and can use it, this is generally the most efficient approach.
Rules-based data synthesis involves defining declarative rules for generating every column, relationship, and cardinality. Tools like Mockaroo and the initial release of Tonic Fabricate use this method. It allows high throughput and produces repeatable outcomes, but comes at the expense of expressiveness. "There's a lot to learn, a lot to set up," Mark explains. "Anything declarative, it's hard to program in a lot of nuance because everything is sort of a one-shot here are my rules, there's the data sort of a thing."
Agentic data generation represents the newest approach, where you provide your schema and requirements to a well-equipped AI agent and let it figure out how to generate the most realistic data. The agent uses ad hoc methods, writes code, and generates data on the fly using its unique understanding of your use case. Mark emphasizes, "It has like a really intuitive UX because there's just almost nothing to learn. You're just chatting with an agent."
How agentic AI works: Beyond simple LLM prompting
The key differentiator of agentic AI is not just the large language model itself, but the tools and environment provided to it. Mark clarifies, "If you went into Gemini today and you asked it, give me fake data for this scenario, it's very hard for it to scale to like more than one table."
An agent combines a "brain" (the LLM) with a "body" (tools that allow autonomous interaction with the outside world). Tonic Fabricate equips Claude Sonnet and Haiku with:
- A SQLite database for iterative data building with relational integrity
- The ability to write and execute JavaScript code in a secure sandbox
- Built-in data generators for algorithmic tasks like credit card numbers, UUIDs, and object IDs
- Tools for querying, file manipulation, document generation, and API connections
This architecture allows the agent to generate data much more efficiently than token-based generation. Instead of outputting every row using LLM tokens (which would be slow and expensive), the agent writes code that generates data programmatically. "It's not using LLM AI tokens to generate each data point because that would be very slow and very expensive. It's coming up with these little mini programs on the fly," Mark explains.
Live demo: From schema to synthetic data in minutes
Mark demonstrates the workflow by uploading a typical online store schema with customers, products, orders, line items, payment methods, and payments. The process is remarkably simple:
- Drop the SQL schema into Fabricate
- Answer a few questions about record counts, customer demographics, product types, and order timeframes
- Watch as the agent analyzes the schema, writes code, and generates data progressively
The agent works through each table iteratively, querying existing data to build out foreign keys and maintain referential integrity. "It's actually querying the existing data to build out the keys," Mark notes. "It can look at and sample the existing data. It's a very iterative process."
Within minutes, the system generates thousands of realistic records with proper relationships, realistic product names and descriptions, and appropriate pricing. The generated data can be exported in multiple formats or pushed directly to supported databases.
Workflows: Automating API data generation
A new feature called workflows extends Fabricate's capabilities beyond database generation to API-driven scenarios. Mark demonstrates generating data for a Stripe sandbox environment by:
- Uploading a subset of Stripe's API contract
- Describing the desired outcome (customers, products, prices, subscriptions)
- Letting the agent generate both the data and a workflow to push it to Stripe
The workflow becomes a reusable, callable component that can integrate into CI/CD pipelines or test automation. "These workflows are also callable via APIs as well. So they could become reusable parts of your sort of test data management," Mark explains.
This capability is particularly valuable for platform teams managing multiple environments or microservices. You can model different services within a single Fabricate database, maintain relational integrity across dimensions, and push data to various systems through automated workflows.
Practical applications for platform engineering
Synthetic data generation with agentic AI addresses several common platform engineering challenges:
Developer self-service - Consumers of synthetic data can skip the middleman and use the tool directly. "The consumers can sort of skip the middleman. They just use the tool directly. They drop into it and in 20 minutes they get the exact data set that they need," Mark says.
CI/CD integration - Workflows can be called via API, making them reusable components in automated pipelines for standing up test environments or generating data for integration tests.
Compliance and security - Teams can generate data without ever accessing production systems, reducing risk and bureaucratic overhead. "You need to be able to generate data from scratch, and you can't access production data at all. It's just too risky," Mark notes.
API mocking and testing - The tool can generate data for API contracts, supporting both backend database approaches and frontend API-driven approaches to environment setup.
AI model development - Fabricate can generate both structured and unstructured data, including realistic documents like PDFs, Word files, and emails based on structured context.
Cost and scalability considerations
Token consumption is a valid concern with AI-based tools. Tonic has implemented several optimizations to keep costs down:
- Large schemas (even with 500+ tables) are translated out-of-band with cheaper models, never entering the main chat context
- The agent learns about tables by querying its own database rather than keeping everything in context
- Data generation happens through code execution rather than token-by-token LLM output
Mark notes, "We can use quite a large context. But Fabricate itself is very much concerned with this problem and is iterating all on it all the time to make the use of those LLM tokens" more efficient.
For scale, generating millions of records per table is feasible. "The data generation is almost instant because the code runs pretty fast. So you have to generate quite a lot of data before the actual data generation winds up taking more than the time just the overhead in writing the code," Mark explains.
If you enjoyed this, find here more great insights and events from our Platform Engineering Community.
Key takeaways
- Agentic AI reduces the barrier to entry for synthetic data generation - Platform teams can provide self-service access to realistic test data without requiring deep domain expertise or extensive configuration, enabling developers to generate what they need in minutes rather than days.
- The combination of LLM reasoning and programmatic execution creates scalable solutions - By writing code that generates data rather than using tokens for each record, agentic approaches can produce millions of records efficiently while maintaining realistic relationships and nuances.
- Workflows transform synthetic data generation from a one-time task to a repeatable platform capability - API-callable workflows integrate synthetic data generation directly into CI/CD pipelines, test automation, and environment provisioning, making it a core component of platform engineering infrastructure.
- Multiple approaches to synthetic data serve different use cases - Production data deidentification works best when you have accessible production data, rules-based synthesis suits repeatable scenarios with clear requirements, and agentic generation

.jpg)