LLM Classifier:用LLM(Lamini和Llama 2)进行数据分类的工具

项目简介

只需提示即可训练新的分类器。不需要数据——但是如果有数据的话可以添加数据来增强。

from lamini import LaminiClassifier
llm = LaminiClassifier()
prompts={  "cat": "Cats are generally more independent and aloof than dogs, who are often more social and affectionate. Cats are also more territorial and may be more aggressive when defending their territory.  Cats are self-grooming animals, using their tongues to keep their coats clean and healthy. Cats use body language and vocalizations, such as meowing and purring, to communicate.",  "dog": "Dogs are more pack-oriented and tend to be more loyal to their human family.  Dogs, on the other hand, often require regular grooming from their owners, including brushing and bathing. Dogs use body language and barking to convey their messages. Dogs are also more responsive to human commands and can be trained to perform a wide range of tasks.",}
llm.prompt_train(prompts)
llm.save("models/my_model.lamini")
llm.predict(["meow"])>> ["cat"]
llm.predict(["meow", "woof"])>> ["cat", "dog"]

注意:每个提示类进行 10 个 LLM 推理调用。请参阅下面的高级部分来更改此设置。

(可选)添加任何数据。

这可以帮助改进你的分类器。例如,如果LLM有错误:

llm.predict(["i like milk", "i like bones"])>> ["dog", "cat"] # wrong!

您可以通过将这些示例添加为数据来更正 LLM。

你的 LLM 分类器将学习它:

llm = LaminiClassifier()
llm.add_data_to_class("cat", "i like milk.")llm.add_data_to_class("dog", ["i like bones"]) # list of examples is valid too
llm.prompt_train(prompts)
llm.predict(["i like milk", "i like bones"])>> ["cat", "dog"] # correct!

如果您包含 classes 中不存在的类的数据,则分类器会将它们作为新类包含在内,并学习预测它们。但是,如果没有提示,它将不会有用于进一步增强它们的描述。

一般准则:如果您没有任何有关课程的数据或只有很少的数据,请确保为其提供良好的提示。就像快速设计任何LLM一样,创建良好的描述——例如包含详细信息和示例——帮助法学硕士获得正确的结果。

您还可以通过文件更轻松地处理这些数据。

# Load datallm.load_examples(saved_examples_path="path/to/examples.jsonl")
# Print dataprint(llm.get_data())
# Save datallm.saved_examples_path = "path/to/examples.jsonl" # overrides default at /tmp/saved_examples.jsonlllm.save_examples()

examples.jsonl 的格式如下:

{"class_name": "cat", "examples": ["i like milk", "meow"]}{"class_name": "dog", "examples": ["woof", "i like bones"]}

高级

更改每个提示的 LLM 示例数量(因此推理调用最多可达此数量):

llm = LaminiClassifier(augmented_example_count=5) # 10 is default

请注意,我们发现 10 是训练有效分类器的良好代理。

如果您正在处理更多类并希望提高性能,较高的 augmented_example_count 会有所帮助。

现在运行

./train.sh

我们有一些默认的类。您可以像这样轻松指定自己的超级:

./train.sh --class "cat: CAT_PROMPT" --class "dog: DOG_PROMPT"

提示是对您的课程的描述。

./classify.sh 'woof'

您可以获得所有类别的概率,在本例中为 dog (62%) 和 cat (38%)。这些可以帮助衡量不确定性。

{ 'data': 'woof', 'prediction': 'dog', 'probabilities': array([0.37996491, 0.62003509])}

这是我们的猫/狗提示。

Cat prompt:

Cats are generally more independent and aloof. Cats are also more territorial and may be more aggressive when defending their territory.Cats are self-grooming animals, using their tongues to keep their coats clean and healthy. Cats use body language and vocalizations,such as meowing and purring, to communicate.  An example cat is whiskers, who is a cat who lives in a house with a human.Another example cat is furball, who likes to eat food and sleep.  A famous cat is garfield, who is a cat who likes to eat lasagna.

Dog prompt:

Dogs are social animals that live in groups, called packs, in the wild. They are also highly intelligent and trainable.Dogs are also known for their loyalty and affection towards their owners. Dogs are also known for their ability to learn andperform a variety of tasks, such as herding, hunting, and guarding.  An example dog is snoopy, who is the best friend ofcharlie brown.  Another example dog is clifford, who is a big red dog.
./classify.sh --data "I like to sharpen my claws on the furniture." --data "I like to roll in the mud." --data "I like to run any play with a ball." --data "I like to sleep under the bed and purr." --data "My owner is charlie brown." --data "Meow, human! I'm famished! Where's my food?" --data "Purr-fect." --data "Hiss! Who dared to wake me from my nap? I'll have my revenge... later." --data "I'm so happy to see you! Can we go for a walk/play fetch/get treats now?" --data "I'm feeling a little ruff today, can you give me a belly rub to make me feel better?"

{'data': 'I like to sharpen my claws on the furniture.', 'prediction': 'cat', 'probabilities': array([0.55363432, 0.44636568])}{'data': 'I like to roll in the mud.', 'prediction': 'dog', 'probabilities': array([0.4563782, 0.5436218])}{'data': 'I like to run any play with a ball.', 'prediction': 'dog', 'probabilities': array([0.44391914, 0.55608086])}{'data': 'I like to sleep under the bed and purr.', 'prediction': 'cat', 'probabilities': array([0.51146226, 0.48853774])}{'data': 'My owner is charlie brown.', 'prediction': 'dog', 'probabilities': array([0.40052991, 0.59947009])}{'data': "Meow, human! I'm famished! Where's my food?", 'prediction': 'cat', 'probabilities': array([0.5172964, 0.4827036])}{'data': 'Purr-fect.', 'prediction': 'cat', 'probabilities': array([0.50431873, 0.49568127])}{'data': "Hiss! Who dared to wake me from my nap? I'll have my revenge... "         'later.', 'prediction': 'cat', 'probabilities': array([0.50088163, 0.49911837])}{'data': "I'm so happy to see you! Can we go for a walk/play fetch/get treats "         'now?', 'prediction': 'dog', 'probabilities': array([0.42178513, 0.57821487])}{'data': "I'm feeling a little ruff today, can you give me a belly rub to make "         'me feel better?', 'prediction': 'dog', 'probabilities': array([0.46141002, 0.53858998])}

安装

克隆此存储库,然后运行  train.sh 或 classify.sh 命令行工具。

需要 Docker:https://docs.docker.com/get-docker

设置您的 lamini 密钥(免费):https://lamini-ai.github.io/

git clone git@github.com:lamini-ai/llm-classifier.git

cd llm-classifier

训练一个新的分类器。

./train.sh --help
usage: train.py [-h] [--class CLASS [CLASS ...]] [--train TRAIN [TRAIN ...]] [--save SAVE] [-v]
options:  -h, --help            show this help message and exit  --class CLASS [CLASS ...]                        The classes to use for classification, in the format 'class_name:prompt'.  --train TRAIN [TRAIN ...]                        The training data to use for classification, in the format 'class_name:data'.  --save SAVE           The path to save the model to.  -v, --verbose         Whether to print verbose output.

 对您的数据进行分类。

./classify.sh --help
usage: classify.py [-h] [--data DATA [DATA ...]] [--load LOAD] [-v] [classify ...]
positional arguments:  classify              The data to classify.
options:  -h, --help            show this help message and exit  --data DATA [DATA ...]                        The training data to use for classification, any string.  --load LOAD           The path to load the model from.  -v, --verbose         Whether to print verbose output.

Python库

安装它 pip install lamini

实例化一个分类器

from lamini import LaminiClassifier
# Create a new classifierclassifier = LaminiClassifier()

使用提示定义类

classes = { "SOME_CLASS" : "SOME_PROMPT" }
classifier.prompt_train(classes)

或者如果您有一些培训示例(可选)

data = ["example 1", "example 2"]classifier.add_data_to_class("SOME_CLASS", data)
# Don't forget to train after adding dataclassifier.prompt_train()

对您的数据进行分类

# Classify the data - in a list of string(s)prediction = classifier.predict(list_of_strings)
# Get the probabilities for each classprobabilities = classifier.predict_proba(list_of_strings)

保存您的模型

classifier.save("SOME_PATH")

加载您的模型

classifier = LaminiClassifier.load("SOME_PATH")

它是如何工作的?

LLM 分类器使用 Llama 2 LLM 将您的提示转换为一堆数据。然后它会微调另一个 LLM 以区分每堆数据。

我们使用从 Llama 2 派生的几个专门的 LLM,将提示转换为每个课程的大量训练示例。如果您想查看的话,可以在 lamini python 包中找到该代码。

这完美吗?

不,这是一个每周晚上的黑客马拉松项目,给我们反馈,我们会改进它。一些已知问题:

  1. 它不会在类中大量使用批处理,因此许多类的训练速度可以加快 100 倍以上。
  2. 我们正在完善 LLM 示例生成器。将您在提示中发现的任何问题发送给我们,我们可以改进这些模型。

为什么我不只使用普通的分类器,如 BART、XGBoost、BERT 等?

您不需要使用 LaminiClassifier 来标记任何数据。标记数据很糟糕。

无需摆弄超参数。相反,摆弄提示。希望英语比attention_dropout_pcts 更容易。

为什么我不直接使用LLM?

分类器始终输出有效的类。法学硕士可能会用“嗯……这取决于……”来回答“这是在谈论一只猫吗”的问题。写一个解析器很糟糕。

额外的好处:分类器为您提供概率并且可以校准:https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/

项目链接

https://github.com/lamini-ai/llm-classifier

未经允许不得转载:表盘吧 » LLM Classifier:用LLM(Lamini和Llama 2)进行数据分类的工具