HFTrainer
Trains a new Hugging Face Transformer model using the Trainer framework.
Example
The following shows a simple example using this pipeline.
import pandas as pd
from datasets import load_dataset
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
# Pandas DataFrame
df = pd.read_csv("training.csv")
model, tokenizer = trainer("bert-base-uncased", df)
# Hugging Face dataset
ds = load_dataset("glue", "sst2")
model, tokenizer = trainer("bert-base-uncased", ds["train"], columns=("sentence", "label"))
# List of dicts
dt = [{"text": "sentence 1", "label": 0}, {"text": "sentence 2", "label": 1}]]
model, tokenizer = trainer("bert-base-uncased", dt)
# Support additional TrainingArguments
model, tokenizer = trainer("bert-base-uncased", dt,
learning_rate=3e-5, num_train_epochs=5)
All TrainingArguments are supported as function arguments to the trainer call.
See the links below for more detailed examples.
Notebook | Description | |
---|---|---|
Train a text labeler | Build text sequence classification models | |
Train without labels | Use zero-shot classifiers to train new models | |
Train a QA model | Build and fine-tune question-answering models | |
Train a language model from scratch | Build new language models |
Training tasks
The HFTrainer pipeline builds and/or fine-tunes models for following training tasks.
Task | Description |
---|---|
language-generation | Causal language model for text generation (e.g. GPT) |
language-modeling | Masked language model for general tasks (e.g. BERT) |
question-answering | Extractive question-answering model, typically with the SQuAD dataset |
sequence-sequence | Sequence-Sequence model (e.g. T5) |
text-classification | Classify text with a set of labels |
token-detection | ELECTRA-style pre-training with replaced token detection |
PEFT
Parameter-Efficient Fine-Tuning (PEFT) is supported through Hugging Face's PEFT library. Quantization is provided through bitsandbytes. See the examples below.
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
trainer(..., quantize=True, lora=True)
When these parameters are set to True, they use default configuration. This can also be customized.
quantize = {
"load_in_4bit": True,
"bnb_4bit_use_double_quant": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": "bfloat16"
}
lora = {
"r": 16,
"lora_alpha": 8,
"target_modules": "all-linear",
"lora_dropout": 0.05,
"bias": "none"
}
trainer(..., quantize=quantize, lora=lora)
The parameters also accept transformers.BitsAndBytesConfig
and peft.LoraConfig
instances.
See the following PEFT documentation links for more information.
Methods
Python documentation for the pipeline.
__call__(base, train, validation=None, columns=None, maxlength=None, stride=128, task='text-classification', prefix=None, metrics=None, tokenizers=None, checkpoint=None, quantize=None, lora=None, **args)
Builds a new model using arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base
|
path to base model, accepts Hugging Face model hub id, local path or (model, tokenizer) tuple |
required | |
train
|
training data |
required | |
validation
|
validation data |
None
|
|
columns
|
tuple of columns to use for text/label, defaults to (text, None, label) |
None
|
|
maxlength
|
maximum sequence length, defaults to tokenizer.model_max_length |
None
|
|
stride
|
chunk size for splitting data for QA tasks |
128
|
|
task
|
optional model task or category, determines the model type, defaults to "text-classification" |
'text-classification'
|
|
prefix
|
optional source prefix |
None
|
|
metrics
|
optional function that computes and returns a dict of evaluation metrics |
None
|
|
tokenizers
|
optional number of concurrent tokenizers, defaults to None |
None
|
|
checkpoint
|
optional resume from checkpoint flag or path to checkpoint directory, defaults to None |
None
|
|
quantize
|
quantization configuration to pass to base model |
None
|
|
lora
|
lora configuration to pass to PEFT model |
None
|
|
args
|
training arguments |
{}
|
Returns:
Type | Description |
---|---|
(model, tokenizer) |
Source code in txtai/pipeline/train/hftrainer.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|