Data generation
The aindo.rdml.synth.llm module provides a Generator class
to allow the user to use an LLM to efficiently generate data which conforms to the JSON structure
needed to be interpreted as a relational dataset.
This class is implicitly used in the generation method of the task objects
(BaseTask.generate()), but can be independently used by the user.
Its output must be post-processed by the BaseTask.inverse_transform() method
discussed in the task section to obtain the final relational data.
Each Generator instance is based on an Engine,
which refers to a framework for efficient generation from the LLM.
At the moment, the available engines are vLLM, SGLang,
and OpenAI.
To instantiate a Generator object, one can use the
Generator.from_engine() class method, providing:
engine: TheEngineto use for generation.model: The model name or path to be used.
The optional parameters are:
model_dir: The directory were to store the model converted tofloat16for generation.unsloth: Whether the model has been fine-tuned with Unsloth.cfg_filepath: A path were to save the generation configuration.args: Optional CLI arguments for the generation engine.kwargs: Optional keyword arguments for the generation engine.
The Generator.generate() method allows to generate data that conform
to the target JSON schema, guaranteeing that it can be brought back to relational data form.
If the guided parameter is set to True (the default), the generation is performed using the structured output
routines of the chosen backend engine, using the required target JSON schema, for each generated sample.
The output should automatically validate the provided JSON schema, however there are cases when the output does not
conform, for example if the maximum amount of generated tokens is reached before a valid JSON structure is attained.
In these cases, as well as in the case the parameter guided is set to False, the invalid output generated
is discarded and the generation process is repeated until the desired number of valid samples are generated.
The maximum number of iterations can be fixed with the retry_on_fail parameter.
The Generator.generate() method has the following arguments:
prompt_template: The template for building the generation prompts. Very similar to the one used in the fine-tuning routines, either a formattable string (with keys the fields ofRelGenPrompt), or a callable that returns the prompt from aRelGenPromptobject.prompts: TheRelGenPromptobjects, as returned byBaseTask.prompt(). It can be either a single promptRelGenPrompt, or an iterable of prompts.n: The number of generated samples per prompt. It can be an integer, to be used for all given prompts, or an iterable with an integer for each given prompt.guided: Whether to enable the structured output generation of the backend engine.rejection_filepath: An optional path to a file where to save some information about the rejected generations.retry_on_fail: The maximum number of iterations of the generation process. If after this number of steps, the generated valid samples are less than the requested ones, the generation is interrupted.
It also accepts variadic keyword arguments, that are directly passed to the sampling parameters of the backend engine.
The Generator.generate() method returns a tuple with two objects:
- A nested list of generated samples in JSON form (Python dictionaries). If the generation succeeded, each element corresponds to one input prompt, and is a list containing the samples generated from it.
- A nested list with the same structure, containing one
Invalidobject for each rejected sample.
In the following we present an example of using the Generator class
to perform a generation of fully synthetic data.
Since all generated samples use the same prompt, we can build a single prompt, and specify the desired
number of samples (n_samples) to the n parameter of the
Generator.generate() method.
The output is a list of a single element, with the n_samples generated samples, in JSON form.
To obtain the relational data, it is necessary to invoke the
RelSynth.inverse_transform() method
(in this example, with no context).
import pandas as pd
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth.llm import (
RelSynthPreproc,
RelSynth,
Generator,
Engine,
)
data: RelationalData = ...
preproc = RelSynthPreproc.from_data(data=data)
n_samples = data.n_samples[preproc.root]
task = RelSynth(preproc=preproc)
prompts = task.prompt(n_samples=1) # all samples use the same prompt
prompt = next(iter(prompts))
# Build the `Generator`, and generate the data.
generator = Generator.from_engine(
engine=Engine.VLLM,
model="your-favourite-model", # optionally fine-tuned on one or more tasks
kwargs={ # optional keyword arguments to initialize the backend engine, these refer to vLLM
"max_num_seqs": 200,
"gpu_memory_utilization": 0.75,
},
)
data_valid, _ = generator.generate(
prompt_template="Data schema: {schema}.\nSynthetic data:",
prompts=prompt,
n=n_samples, # sample `n_samples`
max_tokens=32768, # optional keyword argument, passed to `vllm.SamplingParams`
)
# Inverse transform
data_valid = data_valid[0] # there are `n_samples` generation, for the unique prompt
data_synth = task.inverse_transform(x=None, y=data_valid, ctx=None, progress=len(data_valid))
Build your own engine (expert user)
Expert users may want to use another generation engine, which is not supported in the
Engine enumeration.
To define a custom engine it is enough to inherit from the BaseEngine
abstract class and implement the BaseEngine.generate abstract method.
The latter takes as parameters:
prompt_templatepromptsnguided
They all have the same meaning as their respective counterparts in
Generator.generate().
The BaseEngine.generate() method also supports variadic keyword arguments
that correspond to the variadic keyword arguments of Generator.generate().
Once the custom engine is defined and initialized, the Generator object
can be instantiated directly from it.
from collections.abc import Callable, Iterable
from typing import Any
from aindo.rdml.synth.llm import BaseEngine, Generator, RelGenPrompt
class MyCustomEngine(BaseEngine):
def __init__(self, arg1, arg2) -> None:
# Set up the engine here
self.param1 = arg1
self.param2 = arg2
...
def generate(
self,
prompt_template: str | Callable[[RelGenPrompt], str],
prompts: RelGenPrompt | Iterable[RelGenPrompt],
n: int | Iterable[int] = 1,
guided: bool = True,
**kwargs: Any,
) -> list[list[str]]:
# Define the generation process here
...
# Return a list of generated JSON objects, with one element per prompt,
# and with `n` samples for each prompt
return ...
engine = MyCustomEngine(arg1=..., arg2=...)
generator = Generator(engine=engine)
data_valid, _ = generator.generate(
prompt_template="Data schema: {schema}.\nSynthetic data:",
prompts=...,
n=...,
..., # optional keyword arguments passed to `MyCustomEngine.generate()`
)