Apply basic recipe to the dataframe that includes a text column. The basic recipe includes tokenization (using bigrams), removing stop words, filtering stop words by max tokens = 1,000, and normalization of document length using TF-IDF.

apply_basic_recipe(
  input_data,
  formula,
  text,
  token_threshold = 1000,
  add_embedding = NULL,
  embed_dims = 100
)

Arguments

input_data

An input data.

formula

A formula that specifies the relationship between the outcome and predictor variables (e.g, category ~ text.

text

The name of the text column in the data.

token_threshold

The maximum number of the tokens will be used in the classification.

add_embedding

Add word embedding for feature engineering. The default value is NULL. Replace NULL with TRUE, if you want to add word embedding.

embed_dims

Word embedding dimensions. The default value is 100.

Value

A prep object.