Skip to content

Data configuration

The data configuration outlines the structure of the synthetic data by specifying detailed formats for both columns and tables. This is needed to ensure that the generative models adhere to the defined structure at inference time.

Column configuration

A LlmColumnCfg encapsulates all necessary information to generate a column of synthetic data. It is used to ensure that the generated data matches the desired type while maintaining semantic consistency. Advanced users can enforce specific formats or patterns in the generated values.

This configuration is the foundation for defining an LlmTableCfg used to generate structured synthetic data.

The required parameters are:

  • type: The column type, corresponding to a valid Column (e.g., Column.STRING, Column.INTEGER).
  • description: A brief prompt guiding the LLM (e.g., "Full name including middle initials").
  • structure (optional): Constraints defining format, ranges, or patterns.

As an example, for a column containing the names of some individuals:

from aindo.rdml.relational import Column
from aindo.rdml.synth.llm import LlmColumnCfg

column_cfg = LlmColumnCfg(
    type=Column.STRING,
    description="Name of the individual",
    structure=None,
)

Custom column structures (advanced user)

To provide precise control, custom column structures allow users to enforce specific formats or patterns. The available structures include:

Structure Type Use Case Example
CategoricalColumnStructure Fixed categories Gender: ["M", "F", "Other"]
IntegerColumnStructure Bounded integers Age: No negatives
FloatColumnStructure Formatted decimals Weight: XX.X kg
WordSequenceColumnStructure Text patterns Product names: 2-4 words
CustomRegexColumnStructure Custom patterns ISBNs, phone numbers, emails

These structures ensure the generated data meets exact specifications. For example, an integer column for "Age" can enforce positivity, while a categorical column for "Gender" allows only predefined values.

Consider the following examples to create columns containing specific personal information:

from aindo.rdml.synth.llm import (
    CategoricalColumnStructure,
    CustomRegexColumnStructure,
    FloatColumnStructure,
    IntegerColumnStructure,
    WordSequenceColumnStructure,
)

gender_struct = CategoricalColumnStructure(categories=["M", "F", "Other"])

age_struct = IntegerColumnStructure(
    min_digits=1,
    max_digits=3,
    positive=True,
)

weight_struct = FloatColumnStructure(
    min_int_digits=2,
    max_int_digits=3,
    min_decimal_digits=0,
    max_decimal_digits=1,
    positive=True,
    has_ints=True,
)

name_struct = WordSequenceColumnStructure(
    min_word_len=3,
    max_word_len=15,
    min_n_words=1,
    max_n_words=2,
    allow_digits=False,
)

simple_gmail_struct = CustomRegexColumnStructure(regex=r"[a-zA-Z0-9]{1,15}@gmail.com")

Table configuration

To generate a table, define its content using LlmTableCfg. This includes:

  • name: Table identifier (e.g., "Customers").
  • description: Context for the LLM (e.g., "E-commerce user profiles with purchase history").
  • columns: A dictionary of LlmColumnCfg objects, one for each table column.
from aindo.rdml.relational import Column
from aindo.rdml.synth.llm import CustomRegexColumnStructure, LlmColumnCfg, LlmTableCfg

table_cfg = LlmTableCfg(
    name="Individuals",
    description="A table containing personal information about individuals.",
    columns={
        "name": LlmColumnCfg(
            type=Column.STRING,
            description="Name of the individual",
            structure=None
        ),
        "age": LlmColumnCfg(
            type=Column.INTEGER,
            description="Age of the individual",
            structure=None
        ),
        "email": LlmColumnCfg(
            type=Column.STRING,
            description="Email ending with @gmail.com",
            structure=CustomRegexColumnStructure(regex=r"[a-zA-Z0-9]{1,15}@gmail.com"),
        ),
    },
)