module great_start


class GReaTStart

Abstract super class GReaT Start

GReaT Start creates tokens to start the generation process.

Attributes:

  • tokenizer (AutoTokenizer): Tokenizer, automatically downloaded from llm-checkpoint

method GReaTStart.__init__

__init__(tokenizer)

Initializes the super class.

Args:

  • tokenizer: Tokenizer from the HuggingFace library

method GReaTStart.get_start_tokens

get_start_tokens(n_samples: int) → List[List[int]]

Get Start Tokens

Creates starting points for the generation process

Args:

  • n_samples: Number of start prompts to create

Returns: List of n_sample lists with tokens


class CategoricalStart

Categorical Starting Feature

A categorical column with its categories is used as starting point.

Attributes:

  • start_col (str): Name of the categorical column
  • population (list[str]): Possible values the column can take
  • weights (list[float]): Probabilities for the individual categories

method CategoricalStart.__init__

__init__(tokenizer, start_col: str, start_col_dist: dict)

Initializes the Categorical Start

Args:

  • tokenizer: Tokenizer from the HuggingFace library
  • start_col: Name of the categorical column
  • start_col_dist: Distribution of the categorical column (dict of form {"Cat A": 0.8, "Cat B": 0.2})

method CategoricalStart.get_start_tokens

get_start_tokens(n_samples)

class ContinuousStart

Continuous Starting Feature

A continuous column with some noise is used as starting point.

Attributes:

  • start_col (str): Name of the continuous column
  • start_col_dist (list[float]): The continuous column from the train data set
  • noise (float): Size of noise that is added to each value
  • decimal_places (int): Number of decimal places the continuous values have

method ContinuousStart.__init__

__init__(
    tokenizer,
    start_col: str,
    start_col_dist: List[float],
    noise: float = 0.01,
    decimal_places: int = 5
)

Initializes the Continuous Start

Args:

  • tokenizer: Tokenizer from the HuggingFace library
  • start_col: Name of the continuous column
  • start_col_dist: The continuous column from the train data set
  • noise: Size of noise that is added to each value
  • decimal_places: Number of decimal places the continuous values have

method ContinuousStart.get_start_tokens

get_start_tokens(n_samples)

class RandomStart

Random Starting Features

Random column names are used as start point. Can be used if no distribution of any column is known.

Attributes:

  • all_columns (List[str]): Names of all columns

method RandomStart.__init__

__init__(tokenizer, all_columns: List[str])

Initializes the Random Start

Args:

  • tokenizer: Tokenizer from the HuggingFace library
  • all_columns: Names of all columns

method RandomStart.get_start_tokens

get_start_tokens(n_samples)

This file was automatically generated via lazydocs.