Generation
LLMs can generate data in a variety of contexts.
The LlmTabularModel
provides two methods to generate synthetic data
from scratch, in two common scenarios:
-
Given a table schema, the
LlmTabularModel.generate()
method generates novel synthetic data following the provided schema. -
Given an existing dataset, the
LlmTabularModel.add_columns()
method generates new synthetic columns that can be integrated in the dataset.
Data generation
To generate synthetic data that adhere to a specific table schema, the user should resort to the
LlmTabularModel.generate()
method.
The required parameters are:
cfg
: The configuration of the data structure to generate.n_samples
: The number of records to generate.
There are also some optional parameters:
batch_size
: The number of samples generated per batch. Adjust based on available memory and performance requirements: reduce for memory constraints or increase to gain a speedup in the process.max_tokens
: The maximum number of tokens to generate for each sample.generation_mode
: The generation mode for the LLM model. It can be aGenerationMode
, or a string representation of the latter (structured
,rejection
).retry_on_fail
: The number of retry attempts if a sample is rejected if it does not meet the specified constraints.temp
: The temperature parameter for sampling.
Performance considerations
Generating structured data with an LLM is computationally expensive. The performance depends on several factors, including:
- Number of columns: More columns increase complexity and memory usage.
- Complexity of constraints: Strict constraints (e.g., regex-based structures) may increase rejection rates.
- Data length: Longer text-based columns require more tokens and increase processing time.
- Batch size: Larger batch sizes speed up generation but demand more memory.
Since LLM inference is resource-intensive, using a GPU is highly recommended to speed up the process. To specify the computation device, set the model’s device attribute:
from aindo.rdml.synth.llm import LlmTabularModel
model = LlmTabularModel.load(ckpt_path=...)
model.device = "cuda"
Generation modes
The module supports two approaches to data generation:
-
GenerationMode.STRUCTURED
(guided generation) The LLM is guided to produce structured data that adheres to the predefined structure. This approach is efficient but may cause inconsistencies if the model struggles to follow the constraints (e.g., generating numbers instead of names). -
GenerationMode.REJECTION
(filter-based generation) The model generates data freely, and samples that do not meet the constraints are discarded and regenerated. This method is more reliable but less efficient since rejected samples require additional processing. Theretry_on_fail
parameter controls how many times the model attempts to regenerate valid data.
from aindo.rdml.relational import Column
from aindo.rdml.synth.llm import LlmTabularModel
from aindo.rdml.synth.llm import LlmColumnCfg, LlmTableCfg, GenerationMode
from pathlib import Path
table_cfg = LlmTableCfg(
name="Individuals",
description="A table with information about individuals",
columns={
"name": LlmColumnCfg(
type=Column.STRING,
description="Name of the individual",
structure=None,
),
"age": LlmColumnCfg(
type=Column.INTEGER,
description="Age of the individual",
structure=None,
),
}
)
model = LlmTabularModel.load(model_path=Path("path/to/ckpt"))
model.device = "cuda"
data = model.generate(
cfg=table_cfg,
n_samples=10,
batch_size=1,
generation_mode=GenerationMode.STRUCTURED,
)
Note: When defining custom regex constraints for LLM output, ensure they align with the expected natural structure of the table. Imposing an overly rigid or unnatural format may lead to incoherent outputs.
Tip: If the model struggles to generate data that meets the constraints in
GenerationMode.STRUCTURED
, consider increasing the maximum number of generated tokens per sample (max_tokens
) or relaxing the constraints.Tip: If you notice a high rejection rate in
GenerationMode.REJECTION
, try relaxing the constraints or increasing the maximum number of retries (retry_on_fail
).
Adding columns
Beyond generating entirely new datasets, the model allows adding synthetic columns to an existing dataset. This is particularly useful for:
- Data normalization: Converting inconsistent labels (e.g., "M", "Male", "m") into a standardized format.
- Feature enrichment: Creating new columns based on existing data (e.g., adding a "Salary Range" column based on job titles).
- Extracting structured data: Converting unstructured text into a structured format (e.g., extracting dates from free-text fields).
The LlmTabularModel.add_columns()
method integrates new columns
into an existing RelationalData
object.
The required parameters are:
data
: ARelationalData
object with the context data to which to add the new columns.context_cfg
: ALlmTableCfg
object describing the configuration of the context table data structure.new_columns
: A dictionary with theLlmColumnCfg
objects containing the configurations for the new columns to add.
The optional parameters are the same as for the
LlmTabularModel.generate()
method.
In the following example, we consider a dataset containing names, ages, and gender labels (some of which are inconsistent or missing). We add a "Normalized Gender" column that ensures the values are either "M" or "F".
import pandas as pd
from aindo.rdml.relational import Column, RelationalData, Schema, Table
from aindo.rdml.synth.llm import CategoricalColumnStructure, LlmColumnCfg, LlmTableCfg, LlmTabularModel, GenerationMode
from pathlib import Path
df = pd.DataFrame({
"name": ["Alice", "Bob", "John", "Mary"],
"age": [25, 30, 35, 40],
"gender": ["Female", "M", "", "F"],
})
table = Table(
name=Column.STRING,
age=Column.INTEGER,
gender=Column.CATEGORICAL,
)
data = RelationalData(data={'Individuals': df}, schema=Schema(Individuals=table))
model = LlmTabularModel.load(ckpt_path=Path("path/to/ckpt")) # Path to ckpt provided by Aindo
model.device = "cuda"
new_column = model.add_columns(
data=data,
context_cfg=LlmTableCfg(
name="Individuals",
description="A table with information about individuals",
columns={
"name": LlmColumnCfg(
type=Column.STRING,
description="Name of the individual",
),
"age": LlmColumnCfg(
type=Column.INTEGER,
description="Age of the individual",
),
},
),
new_columns={
"Normalized Gender": LlmColumnCfg(
type=Column.CATEGORICAL,
description="Gender of the Individual, formatted to be either 'M' or 'F'",
structure=CategoricalColumnStructure(categories=["M", "F"]),
),
},
generation_mode=GenerationMode.STRUCTURED,
)