LLM based models
The aindo.rdml.synth.llm module leverages the capabilities of Large Language Models (LLMs)
to generate synthetic data.
With the LLMs the user can perform the same tasks as with the normal neural models, but can take advantage of the power
of LLMs and their pregress information coming from the pretraining, that is usually performed on huge amount of data.
With the aindo.rdml.synth.llm module, a pretrained LLM can be fine-tuned
on one or more downstream task of generating synthetic data, and on one or more datasets,
allowing the model to learn better representations of the relational tabular data.
LLMs can be particularly helpful when the original dataset is relatively small (down to a few hundreds of records), since they are very expressive, and they can learn also from relatively few examples. Moreover, if the model was pretrained on some other similar dataset, the knowledge coming from this pretraining may help generating even better synthetic data.
At its core, the aindo.rdml.synth.llm module provides a set of tools to preprocess
relational datasets into objects that can be used to fine-tune any LLM.
Then the user may decide to fine-tune an LLM of their choice, use it to generate the desired data,
and finally rely on the post-processing utilities of aindo.rdml.synth.llm
to bring the generated data back to the same initial relational data form.
Even if its main interest lies in the pre- and post-processing functionalities,
the aindo.rdml.synth.llm module also offers additional tools to:
- Build a training dataset, including different combinations of relational datasets and tasks.
- Finetune an LLM of choice on the created training dataset.
- Use a (potentially fine-tuned) LLM to generate synthetic data using efficient inference frameworks.