module great


class GReaT

GReaT Class

The GReaT class handles the whole generation flow. It is used to fine-tune a large language model for tabular data, and to sample synthetic tabular data.

Attributes:

  • llm (str): HuggingFace checkpoint of a pretrained large language model, used a basis of our model
  • tokenizer (AutoTokenizer): Tokenizer, automatically downloaded from llm-checkpoint
  • model (AutoModelForCausalLM): Large language model, automatically downloaded from llm-checkpoint
  • experiment_dir (str): Directory, where the training checkpoints will be saved
  • epochs (int): Number of epochs to fine-tune the model
  • batch_size (int): Batch size used for fine-tuning
  • train_hyperparameters (dict): Additional hyperparameters added to the TrainingArguments used by the HuggingFaceLibrary, see here the full list of all possible values
  • columns (list): List of all features/columns of the tabular dataset
  • num_cols (list): List of all numerical features/columns of the tabular dataset
  • conditional_col (str): Name of a feature/column on which the sampling can be conditioned
  • conditional_col_dist (dict | list): Distribution of the feature/column specified by condtional_col

method GReaT.__init__

__init__(
    llm: str,
    experiment_dir: str = 'trainer_great',
    epochs: int = 100,
    batch_size: int = 8,
    **train_kwargs
)

Initializes GReaT.

Args:

  • llm: HuggingFace checkpoint of a pretrained large language model, used as basis for our model
  • experiment_dir: Directory, where the training checkpoints will be saved
  • epochs: Number of epochs to fine-tune the model
  • batch_size: Batch size used for fine-tuning
  • train_kwargs: Additional hyperparameters added to the TrainingArguments used by the HuggingFaceLibrary, see here the full list of all possible values

method GReaT.fit

fit(
    data: Union[pandas.core.frame.DataFrame, numpy.ndarray],
    column_names: Optional[List[str]] = None,
    conditional_col: Optional[str] = None,
    resume_from_checkpoint: Union[bool, str] = False
) → GReaTTrainer

Fine-tune GReaT using tabular data.

Args:

  • data: Pandas DataFrame or Numpy Array that contains the tabular data
  • column_names: If data is Numpy Array, the feature names have to be defined. If data is Pandas DataFrame, the value is ignored
  • conditional_col: If given, the distribution of this column is saved and used as a starting point for the generation process later. If None, the last column is considered as conditional feature
  • resume_from_checkpoint: If True, resumes training from the latest checkpoint in the experiment_dir. If path, resumes the training from the given checkpoint (has to be a valid HuggingFace checkpoint!)

Returns: GReaTTrainer used for the fine-tuning process


method GReaT.great_sample

great_sample(
    starting_prompts: Union[str, list[str]],
    temperature: float = 0.7,
    max_length: int = 100,
    device: str = 'cuda'
) → DataFrame

Generate synthetic tabular data samples conditioned on a given input.

Args:

  • starting_prompts: String or List of Strings on which the output is conditioned. For example, "Sex is female, Age is 26"
  • temperature: The generation samples each token from the probability distribution given by a softmax function. The temperature parameter controls the softmax function. A low temperature makes it sharper (0 equals greedy search), a high temperature brings more diversity but also uncertainty into the output. (See this blog article to read more about the generation process.)
  • max_length: Maximal number of tokens to generate - has to be long enough to not cut any information
  • device: Set to "cpu" if the GPU should not be used. You can also specify the concrete GPU.

Returns: Pandas DataFrame with synthetic data generated based on starting_prompts


method GReaT.load_finetuned_model

load_finetuned_model(path: str)

Load fine-tuned model

Load the weights of a fine-tuned large language model into the GReaT pipeline

Args:

  • path: Path to the fine-tuned model

classmethod GReaT.load_from_dir

load_from_dir(path: str)

Load GReaT class

Load trained GReaT model from directory.

Args:

  • path: Directory where GReaT model is saved

Returns: New instance of GReaT loaded from directory


method GReaT.sample

sample(
    n_samples: int,
    start_col: Optional[str] = '',
    start_col_dist: Optional[dict, list] = None,
    temperature: float = 0.7,
    k: int = 100,
    max_length: int = 100,
    device: str = 'cuda'
) → DataFrame

Generate synthetic tabular data samples

Args:

  • n_samples: Number of synthetic samples to generate
  • start_col: Feature to use as starting point for the generation process. If not given, the target learned during the fitting is used as starting point
  • start_col_dist: Feature distribution of the starting feature. Should have the format "{F1: p1, F2: p2, ...}" for discrete columns or be a list of possible values for continuous columns. If not given, the target distribution learned during the fitting is used as starting point
  • temperature: The generation samples each token from the probability distribution given by a softmax function. The temperature parameter controls the softmax function. A low temperature makes it sharper (0 equals greedy search), a high temperature brings more diversity but also uncertainty into the output. (See this blog article to read more about the generation process.)
  • k: Sampling Batch Size. Set as high as possible. Speeds up the generation process significantly
  • max_length: Maximal number of tokens to generate - has to be long enough to not cut any information!
  • device: Set to "cpu" if the GPU should not be used. You can also specify the concrete GPU

Returns: Pandas DataFrame with n_samples rows of generated data


method GReaT.save

save(path: str)

Save GReaT Model

Saves the model weights and a configuration file in the given directory.

Args:

  • path: Path where to save the model

This file was automatically generated via lazydocs.