Data configuration
The data configuration outlines the structure of the synthetic data by specifying detailed formats for both columns and tables. This is needed to ensure that the generative models adhere to the defined structure at inference time.
Column configuration
A LlmColumnCfg
encapsulates all necessary information to generate
a column of synthetic data.
It is used to ensure that the generated data matches the desired type while maintaining semantic consistency.
Advanced users can enforce specific formats or patterns in the generated values.
This configuration is the foundation for defining an LlmTableCfg
used to generate structured synthetic data.
The required parameters are:
type
: The column type, corresponding to a validColumn
(e.g.,Column.STRING
,Column.INTEGER
).description
: A brief prompt guiding the LLM (e.g., "Full name including middle initials").structure
(optional): Constraints defining format, ranges, or patterns.
As an example, for a column containing the names of some individuals:
from aindo.rdml.relational import Column
from aindo.rdml.synth.llm import LlmColumnCfg
column_cfg = LlmColumnCfg(
type=Column.STRING,
description="Name of the individual",
structure=None,
)
Custom column structures (advanced user)
To provide precise control, custom column structures allow users to enforce specific formats or patterns. The available structures include:
Structure Type | Use Case | Example |
---|---|---|
CategoricalColumnStructure |
Fixed categories | Gender: ["M", "F", "Other"] |
IntegerColumnStructure |
Bounded integers | Age: No negatives |
FloatColumnStructure |
Formatted decimals | Weight: XX.X kg |
WordSequenceColumnStructure |
Text patterns | Product names: 2-4 words |
CustomRegexColumnStructure |
Custom patterns | ISBNs, phone numbers, emails |
These structures ensure the generated data meets exact specifications. For example, an integer column for "Age" can enforce positivity, while a categorical column for "Gender" allows only predefined values.
Consider the following examples to create columns containing specific personal information:
from aindo.rdml.synth.llm import (
CategoricalColumnStructure,
CustomRegexColumnStructure,
FloatColumnStructure,
IntegerColumnStructure,
WordSequenceColumnStructure,
)
gender_struct = CategoricalColumnStructure(categories=["M", "F", "Other"])
age_struct = IntegerColumnStructure(
min_digits=1,
max_digits=3,
positive=True,
)
weight_struct = FloatColumnStructure(
min_int_digits=2,
max_int_digits=3,
min_decimal_digits=0,
max_decimal_digits=1,
positive=True,
has_ints=True,
)
name_struct = WordSequenceColumnStructure(
min_word_len=3,
max_word_len=15,
min_n_words=1,
max_n_words=2,
allow_digits=False,
)
simple_gmail_struct = CustomRegexColumnStructure(regex=r"[a-zA-Z0-9]{1,15}@gmail.com")
Table configuration
To generate a table, define its content using LlmTableCfg
.
This includes:
name
: Table identifier (e.g., "Customers").description
: Context for the LLM (e.g., "E-commerce user profiles with purchase history").columns
: A dictionary ofLlmColumnCfg
objects, one for each table column.
from aindo.rdml.relational import Column
from aindo.rdml.synth.llm import CustomRegexColumnStructure, LlmColumnCfg, LlmTableCfg
table_cfg = LlmTableCfg(
name="Individuals",
description="A table containing personal information about individuals.",
columns={
"name": LlmColumnCfg(
type=Column.STRING,
description="Name of the individual",
structure=None
),
"age": LlmColumnCfg(
type=Column.INTEGER,
description="Age of the individual",
structure=None
),
"email": LlmColumnCfg(
type=Column.STRING,
description="Email ending with @gmail.com",
structure=CustomRegexColumnStructure(regex=r"[a-zA-Z0-9]{1,15}@gmail.com"),
),
},
)