Every real-world application collects valuable unstructured data. It’s what gives our applications life and makes them relatable. Think of notes, descriptions, emails, chat logs, social media posts, and reviews. These are the elements that fill out our user interfaces and make our applications more than just a collection of numbers and fields.
However, this unstructured data is almost always stored within a highly structured context. A user's message is stored in a database alongside a foreign key to their user record, a timestamp, and other structured metadata. For years, developers used Lorem Ipsum to create mock unstructured data. It's the classic developer’s crutch, a placeholder that fills space but offers nothing in the way of realism.
Lorem Ipsum served us well, but its lack of realism is a distraction. In a demo, it makes your product look unfinished. In testing, it provides a false sense of security, as the data doesn't reflect the chaos and variety of real-world input. Its time has come and gone. We now have a much more powerful tool at our disposal: Large Language Models (LLMs).
The impact of LLMs on unstructured data generation
LLMs are excellent at generating text that simulates real human communication. This has been a core capability since their inception, even before recent innovations in agents, tool use, and complex reasoning. They’re designed to produce coherent, natural-sounding text, and they do so much faster and at a much lower cost than any human could.
At a small scale, LLMs can also generate structured data. If you need a handful of JSON documents for a quick test, you can ask a model to produce them, and it will often get the job done beautifully. But problems arise when you try to use them to generate an entire database.
The limitations of LLMs in structured data generation
When you need to generate data at a large scale, using an LLM on its own falls apart quickly. Real-world databases are complex, with hundreds of tables and specific rules and relationships.
- Complex constraints: Databases are full of constraints. Numbers, dates, and enums often follow particular distributions; relationships between tables (like the number of users in a group) have specific cardinalities. LLMs, on their own, don't understand these intricate rules.
- Uniqueness: LLMs have no inherent understanding of uniqueness. Generating a long list of unique email addresses or usernames is virtually impossible at scale, as a model will inevitably repeat itself.
- Mathematical precision: LLMs are not good at math. If you need to generate an invoice where the sum of line items plus tax and shipping equals the total, an LLM will struggle to get the numbers right. It might get the language perfect, but the math will be a mess. This is one of many limitations of LLMs that is still being researched.
- Speed: Even if you could ignore these limitations, LLMs aren't fast enough to generate a full, realistic database in a reasonable amount of time. Trying to generate a million rows of data using a traditional LLM approach would take an impractical amount of time and money.
For all their strengths, LLMs are not designed to be database engines. They are text generators.
A new approach: Equipping LLMs with structured data
At Tonic.ai, our product Tonic Fabricate offers a solution for generating both structured and unstructured data at scale. Our approach is inspired by the classic concept of a mail merge: we parameterize an LLM prompt with context from your structured data. This allows us to use the LLM for what it’s best at (generating realistic text) while using structured data generation techniques for what they’re best at (creating consistent, rule-abiding data).
Imagine you’re synthesizing data for a medical records system. This system has a ton of structured data: patient names, ages, locations, diagnoses, and treatments. It also has unstructured data, like doctors' notes, sprinkled throughout. With Fabricate, we can first generate all the structured data, ensuring everything from patient names to medical history is consistent and follows the rules of the database.
Then, we use this structured data as context for an LLM to generate realistic doctors' notes. Our Unstructured Data generator allows you to provide this context as variables in a prompt, like so:
"Given the patient's name {patientName}, age {patientAge}, sex {patientSex}, location {patientLocation}, chief complaint {chiefComplaint}, and medical history {medicalHistory}, along with the treating doctor's name {doctorName} and facility {facilityName}, write a realistic doctor's note for this patient's visit on {date}."
By providing the LLM with this rich context, you empower it to generate a note that is not only realistic but also highly specific and relevant to the structured data you've already created.
Why this approach works so well
This method unlocks new levels of efficiency and realism in data generation.
- Cost-effectiveness: Even the most affordable LLMs, like OpenAI's mini models or Claude's Haiku, are fantastic at generating text when given this much context. You’re only paying for the tokens needed to generate the text, not for the model to "figure out" the structured data itself. This keeps your token costs low.
- Variety and uniqueness: By generating the structured data first, you create a massive number of possible combinations. Each combination of patient data, doctor, and diagnosis becomes a unique prompt, ensuring that the resulting unstructured data is varied and never repetitive.
- Parallelization: The process of using LLMs from structured context is highly parallelizable. You can run thousands of these prompts at the same time, overcoming the slowness of individual LLM calls. If you just asked 10 LLMs to generate a "random doctor's note" in parallel, you'd likely get 10 very similar results. By controlling the context, we ensure that each parallel process generates a unique and relevant note, dramatically speeding up the generation process.
Tonic Fabricate uses the LLM only where it's absolutely needed: generating realistic, human-like text. Everything else is handled by a much more efficient and reliable system.
The takeaway
Over 10 years ago, I launched Mockaroo, which quickly became the highest ranked synthetic data generator on Google. A major factor in its success was that it gave users the ability to tell a story with their data. Using formulas and custom distributions, sales engineers could create demos that supported a narrative: "Gas prices go up in the winter"; "Risk of heart disease is higher in this demographic", etc. Mockaroo allowed you to massage fake data into the right shape.
The approach we take with Fabricate builds on that and adds color. Now, unstructured data supports your structured data. “Not only do gas prices go up in the winter, but users can complain about it too!" This added dimension of realism allows customers to build better, test better, and sell better. And the fact that it's all synthetic means that greenfield projects can have test data as rich as those that have been in production for years.
This methodology isn’t limited to medical records. It can be applied to generating data for any system that combines structured and unstructured data, including:
- Relational and document-based databases;
- A corpus of documents (PDFs, Word docs, etc.) with consistent metadata; and,
- Simulated user interactions, like chat logs, social media posts, and reviews.
The combination of structured data generation with LLM-powered unstructured data generation gives you the best of both worlds: a database that is both statistically accurate and full of realistic, human-like content. You get all the benefits of real-world data without the privacy and security risks. To see it in action, check out this quick video demonstration of the feature in Tonic Fabricate.
If you're ready to move beyond Lorem Ipsum and start creating truly realistic synthetic data, sign up for a free account of Tonic Fabricate today.