`module` `great`

`class` `GReaT`

GReaT Class

The GReaT class handles the whole generation flow. It is used to fine-tune a large language model for tabular data, and to sample synthetic tabular data.

Attributes:

llm (str): HuggingFace checkpoint of a pretrained large language model, used a basis of our model
tokenizer (AutoTokenizer): Tokenizer, automatically downloaded from llm-checkpoint
model (AutoModelForCausalLM): Large language model, automatically downloaded from llm-checkpoint
experiment_dir (str): Directory, where the training checkpoints will be saved
epochs (int): Number of epochs to fine-tune the model
batch_size (int): Batch size used for fine-tuning
train_hyperparameters (dict): Additional hyperparameters added to the TrainingArguments used by the HuggingFaceLibrary, see here the full list of all possible values
columns (list): List of all features/columns of the tabular dataset
num_cols (list): List of all numerical features/columns of the tabular dataset
conditional_col (str): Name of a feature/column on which the sampling can be conditioned
conditional_col_dist (dict | list): Distribution of the feature/column specified by condtional_col

`method` `GReaT.init`

__init__(
    llm: str,
    experiment_dir: str = 'trainer_great',
    epochs: int = 100,
    batch_size: int = 8,
    **train_kwargs
)

Initializes GReaT.

Args:

llm: HuggingFace checkpoint of a pretrained large language model, used as basis for our model
experiment_dir: Directory, where the training checkpoints will be saved
epochs: Number of epochs to fine-tune the model
batch_size: Batch size used for fine-tuning
train_kwargs: Additional hyperparameters added to the TrainingArguments used by the HuggingFaceLibrary, see here the full list of all possible values

`method` `GReaT.fit`

fit(
    data: Union[pandas.core.frame.DataFrame, numpy.ndarray],
    column_names: Optional[List[str]] = None,
    conditional_col: Optional[str] = None,
    resume_from_checkpoint: Union[bool, str] = False
) → GReaTTrainer

Fine-tune GReaT using tabular data.

Args:

data: Pandas DataFrame or Numpy Array that contains the tabular data
column_names: If data is Numpy Array, the feature names have to be defined. If data is Pandas DataFrame, the value is ignored
conditional_col: If given, the distribution of this column is saved and used as a starting point for the generation process later. If None, the last column is considered as conditional feature
resume_from_checkpoint: If True, resumes training from the latest checkpoint in the experiment_dir. If path, resumes the training from the given checkpoint (has to be a valid HuggingFace checkpoint!)

Returns: GReaTTrainer used for the fine-tuning process

`method` `GReaT.great_sample`

great_sample(
    starting_prompts: Union[str, list[str]],
    temperature: float = 0.7,
    max_length: int = 100,
    device: str = 'cuda'
) → DataFrame

Generate synthetic tabular data samples conditioned on a given input.

Args:

starting_prompts: String or List of Strings on which the output is conditioned. For example, "Sex is female, Age is 26"
temperature: The generation samples each token from the probability distribution given by a softmax function. The temperature parameter controls the softmax function. A low temperature makes it sharper (0 equals greedy search), a high temperature brings more diversity but also uncertainty into the output. (See this blog article to read more about the generation process.)
max_length: Maximal number of tokens to generate - has to be long enough to not cut any information
device: Set to "cpu" if the GPU should not be used. You can also specify the concrete GPU.

Returns: Pandas DataFrame with synthetic data generated based on starting_prompts

`method` `GReaT.load_finetuned_model`

load_finetuned_model(path: str)

Load fine-tuned model

Load the weights of a fine-tuned large language model into the GReaT pipeline

Args:

path: Path to the fine-tuned model

`classmethod` `GReaT.load_from_dir`

load_from_dir(path: str)

Load GReaT class

Load trained GReaT model from directory.

Args:

path: Directory where GReaT model is saved

Returns: New instance of GReaT loaded from directory

`method` `GReaT.sample`

sample(
    n_samples: int,
    start_col: Optional[str] = '',
    start_col_dist: Optional[dict, list] = None,
    temperature: float = 0.7,
    k: int = 100,
    max_length: int = 100,
    device: str = 'cuda'
) → DataFrame

Generate synthetic tabular data samples

Args:

n_samples: Number of synthetic samples to generate
start_col: Feature to use as starting point for the generation process. If not given, the target learned during the fitting is used as starting point
start_col_dist: Feature distribution of the starting feature. Should have the format "{F1: p1, F2: p2, ...}" for discrete columns or be a list of possible values for continuous columns. If not given, the target distribution learned during the fitting is used as starting point
temperature: The generation samples each token from the probability distribution given by a softmax function. The temperature parameter controls the softmax function. A low temperature makes it sharper (0 equals greedy search), a high temperature brings more diversity but also uncertainty into the output. (See this blog article to read more about the generation process.)
k: Sampling Batch Size. Set as high as possible. Speeds up the generation process significantly
max_length: Maximal number of tokens to generate - has to be long enough to not cut any information!
device: Set to "cpu" if the GPU should not be used. You can also specify the concrete GPU

Returns: Pandas DataFrame with n_samples rows of generated data

`method` `GReaT.save`

save(path: str)

Save GReaT Model

Saves the model weights and a configuration file in the given directory.

Args:

path: Path where to save the model

This file was automatically generated via lazydocs.

module great

class GReaT

method GReaT.__init__

method GReaT.fit

method GReaT.great_sample

method GReaT.load_finetuned_model

classmethod GReaT.load_from_dir

method GReaT.sample

method GReaT.save

`module` `great`

`class` `GReaT`

`method` `GReaT.init`

`method` `GReaT.fit`

`method` `GReaT.great_sample`

`method` `GReaT.load_finetuned_model`

`classmethod` `GReaT.load_from_dir`

`method` `GReaT.sample`

`method` `GReaT.save`