Generation

LLMs can generate data in a variety of contexts. The LlmTabularModel provides two methods to generate synthetic data from scratch, in two common scenarios:

Given a table schema, the LlmTabularModel.generate() method generates novel synthetic data following the provided schema.
Given an existing dataset, the LlmTabularModel.add_columns() method generates new synthetic columns that can be integrated in the dataset.

Data generation

To generate synthetic data that adhere to a specific table schema, the user should resort to the LlmTabularModel.generate() method. The required parameters are:

cfg: The configuration of the data structure to generate.
n_samples: The number of records to generate.

There are also some optional parameters:

batch_size: The number of samples generated per batch. Adjust based on available memory and performance requirements: reduce for memory constraints or increase to gain a speedup in the process.
max_tokens: The maximum number of tokens to generate for each sample.
generation_mode: The generation mode for the LLM model. It can be a GenerationMode, or a string representation of the latter (structured, rejection).
retry_on_fail: The number of retry attempts if a sample is rejected if it does not meet the specified constraints.
temp: The temperature parameter for sampling.

Performance considerations

Generating structured data with an LLM is computationally expensive. The performance depends on several factors, including:

Number of columns: More columns increase complexity and memory usage.
Complexity of constraints: Strict constraints (e.g., regex-based structures) may increase rejection rates.
Data length: Longer text-based columns require more tokens and increase processing time.
Batch size: Larger batch sizes speed up generation but demand more memory.

Since LLM inference is resource-intensive, using a GPU is highly recommended to speed up the process. To specify the computation device, set the model’s device attribute:

from aindo.rdml.synth.llm import LlmTabularModel

model = LlmTabularModel.load(ckpt_path=...)
model.device = "cuda"

Generation modes

The module supports two approaches to data generation:

GenerationMode.STRUCTURED (guided generation) The LLM is guided to produce structured data that adheres to the predefined structure. This approach is efficient but may cause inconsistencies if the model struggles to follow the constraints (e.g., generating numbers instead of names).
GenerationMode.REJECTION (filter-based generation) The model generates data freely, and samples that do not meet the constraints are discarded and regenerated. This method is more reliable but less efficient since rejected samples require additional processing. The retry_on_fail parameter controls how many times the model attempts to regenerate valid data.

from aindo.rdml.relational import Column
from aindo.rdml.synth.llm import LlmTabularModel
from aindo.rdml.synth.llm import LlmColumnCfg, LlmTableCfg, GenerationMode
from pathlib import Path

table_cfg = LlmTableCfg(
    name="Individuals",
    description="A table with information about individuals",
    columns={
        "name": LlmColumnCfg(
            type=Column.STRING,
            description="Name of the individual",
            structure=None,
        ),
        "age": LlmColumnCfg(
            type=Column.INTEGER,
            description="Age of the individual",
            structure=None,
        ),
    }
)

model = LlmTabularModel.load(model_path=Path("path/to/ckpt"))
model.device = "cuda"

data = model.generate(
    cfg=table_cfg,
    n_samples=10,
    batch_size=1,
    generation_mode=GenerationMode.STRUCTURED,
)

Info

When defining custom regex constraints for LLM output, ensure they align with the expected natural structure of the table. Imposing an overly rigid or unnatural format may lead to incoherent outputs.

Tip

If the model struggles to generate data that meets the constraints in GenerationMode.STRUCTURED, consider increasing the maximum number of generated tokens per sample (max_tokens) or relaxing the constraints.

Tip

If you notice a high rejection rate in GenerationMode.REJECTION, try relaxing the constraints or increasing the maximum number of retries (retry_on_fail).

Adding columns

Beyond generating entirely new datasets, the model allows adding synthetic columns to an existing dataset. This is particularly useful for:

Data normalization: Converting inconsistent labels (e.g., "M", "Male", "m") into a standardized format.
Feature enrichment: Creating new columns based on existing data (e.g., adding a "Salary Range" column based on job titles).
Extracting structured data: Converting unstructured text into a structured format (e.g., extracting dates from free-text fields).

The LlmTabularModel.add_columns() method integrates new columns into an existing RelationalData object. The required parameters are:

data: A RelationalData object with the context data to which to add the new columns.
context_cfg: A LlmTableCfg object describing the configuration of the context table data structure.
new_columns: A dictionary with the LlmColumnCfg objects containing the configurations for the new columns to add.

The optional parameters are the same as for the LlmTabularModel.generate() method.

In the following example, we consider a dataset containing names, ages, and gender labels (some of which are inconsistent or missing). We add a "Normalized Gender" column that ensures the values are either "M" or "F".

import pandas as pd
from aindo.rdml.relational import Column, RelationalData, Schema, Table
from aindo.rdml.synth.llm import CategoricalColumnStructure, LlmColumnCfg, LlmTableCfg, LlmTabularModel, GenerationMode

from pathlib import Path

df = pd.DataFrame({
    "name": ["Alice", "Bob", "John", "Mary"],
    "age": [25, 30, 35, 40],
    "gender": ["Female", "M", "", "F"],
})
table = Table(
    name=Column.STRING,
    age=Column.INTEGER,
    gender=Column.CATEGORICAL,
)
data = RelationalData(data={'Individuals': df}, schema=Schema(Individuals=table))

model = LlmTabularModel.load(ckpt_path=Path("path/to/ckpt"))  # Path to ckpt provided by Aindo
model.device = "cuda"

new_column = model.add_columns(
    data=data,
    context_cfg=LlmTableCfg(
        name="Individuals",
        description="A table with information about individuals",
        columns={
            "name": LlmColumnCfg(
                type=Column.STRING,
                description="Name of the individual",
            ),
            "age": LlmColumnCfg(
                type=Column.INTEGER,
                description="Age of the individual",
            ),
        },
    ),
    new_columns={
        "Normalized Gender": LlmColumnCfg(
            type=Column.CATEGORICAL,
            description="Gender of the Individual, formatted to be either 'M' or 'F'",
            structure=CategoricalColumnStructure(categories=["M", "F"]),
        ),
    },
    generation_mode=GenerationMode.STRUCTURED,
)