module great
class GReaT
GReaT Class
The GReaT class handles the whole generation flow. It is used to fine-tune a large language model for tabular data, and to sample synthetic tabular data.
Attributes:
llm
(str): HuggingFace checkpoint of a pretrained large language model, used a basis of our modeltokenizer
(AutoTokenizer): Tokenizer, automatically downloaded from llm-checkpointmodel
(AutoModelForCausalLM): Large language model, automatically downloaded from llm-checkpointexperiment_dir
(str): Directory, where the training checkpoints will be savedepochs
(int): Number of epochs to fine-tune the modelbatch_size
(int): Batch size used for fine-tuningtrain_hyperparameters
(dict): Additional hyperparameters added to the TrainingArguments used by the HuggingFaceLibrary, see here the full list of all possible valuescolumns
(list): List of all features/columns of the tabular datasetnum_cols
(list): List of all numerical features/columns of the tabular datasetconditional_col
(str): Name of a feature/column on which the sampling can be conditionedconditional_col_dist
(dict | list): Distribution of the feature/column specified by condtional_col
method GReaT.__init__
__init__(
llm: str,
experiment_dir: str = 'trainer_great',
epochs: int = 100,
batch_size: int = 8,
**train_kwargs
)
Initializes GReaT.
Args:
llm
: HuggingFace checkpoint of a pretrained large language model, used as basis for our modelexperiment_dir
: Directory, where the training checkpoints will be savedepochs
: Number of epochs to fine-tune the modelbatch_size
: Batch size used for fine-tuningtrain_kwargs
: Additional hyperparameters added to the TrainingArguments used by the HuggingFaceLibrary, see here the full list of all possible values
method GReaT.fit
fit(
data: Union[pandas.core.frame.DataFrame, numpy.ndarray],
column_names: Optional[List[str]] = None,
conditional_col: Optional[str] = None,
resume_from_checkpoint: Union[bool, str] = False
) → GReaTTrainer
Fine-tune GReaT using tabular data.
Args:
data
: Pandas DataFrame or Numpy Array that contains the tabular datacolumn_names
: If data is Numpy Array, the feature names have to be defined. If data is Pandas DataFrame, the value is ignoredconditional_col
: If given, the distribution of this column is saved and used as a starting point for the generation process later. If None, the last column is considered as conditional featureresume_from_checkpoint
: If True, resumes training from the latest checkpoint in the experiment_dir. If path, resumes the training from the given checkpoint (has to be a valid HuggingFace checkpoint!)
Returns: GReaTTrainer used for the fine-tuning process
method GReaT.great_sample
great_sample(
starting_prompts: Union[str, list[str]],
temperature: float = 0.7,
max_length: int = 100,
device: str = 'cuda'
) → DataFrame
Generate synthetic tabular data samples conditioned on a given input.
Args:
starting_prompts
: String or List of Strings on which the output is conditioned. For example, "Sex is female, Age is 26"temperature
: The generation samples each token from the probability distribution given by a softmax function. The temperature parameter controls the softmax function. A low temperature makes it sharper (0 equals greedy search), a high temperature brings more diversity but also uncertainty into the output. (See this blog article to read more about the generation process.)max_length
: Maximal number of tokens to generate - has to be long enough to not cut any informationdevice
: Set to "cpu" if the GPU should not be used. You can also specify the concrete GPU.
Returns: Pandas DataFrame with synthetic data generated based on starting_prompts
method GReaT.load_finetuned_model
load_finetuned_model(path: str)
Load fine-tuned model
Load the weights of a fine-tuned large language model into the GReaT pipeline
Args:
path
: Path to the fine-tuned model
classmethod GReaT.load_from_dir
load_from_dir(path: str)
Load GReaT class
Load trained GReaT model from directory.
Args:
path
: Directory where GReaT model is saved
Returns: New instance of GReaT loaded from directory
method GReaT.sample
sample(
n_samples: int,
start_col: Optional[str] = '',
start_col_dist: Optional[dict, list] = None,
temperature: float = 0.7,
k: int = 100,
max_length: int = 100,
device: str = 'cuda'
) → DataFrame
Generate synthetic tabular data samples
Args:
n_samples
: Number of synthetic samples to generatestart_col
: Feature to use as starting point for the generation process. If not given, the target learned during the fitting is used as starting pointstart_col_dist
: Feature distribution of the starting feature. Should have the format "{F1: p1, F2: p2, ...}" for discrete columns or be a list of possible values for continuous columns. If not given, the target distribution learned during the fitting is used as starting pointtemperature
: The generation samples each token from the probability distribution given by a softmax function. The temperature parameter controls the softmax function. A low temperature makes it sharper (0 equals greedy search), a high temperature brings more diversity but also uncertainty into the output. (See this blog article to read more about the generation process.)k
: Sampling Batch Size. Set as high as possible. Speeds up the generation process significantlymax_length
: Maximal number of tokens to generate - has to be long enough to not cut any information!device
: Set to "cpu" if the GPU should not be used. You can also specify the concrete GPU
Returns: Pandas DataFrame with n_samples rows of generated data
method GReaT.save
save(path: str)
Save GReaT Model
Saves the model weights and a configuration file in the given directory.
Args:
path
: Path where to save the model
This file was automatically generated via lazydocs.