SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集

Author： Joe
发布时间：May 15, 2024
584 views
No comments
13294 words
Categories： AI

通过Notion查看本文
本文同步发布在j000e.com

数据集

Rookie/Llama-3-8B-Instruct-Chinese · Hugging Face

单数据集版本

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

#3微调前测试
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}
### Input:
{}
### Response:
{}"""

FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "请用中文回答", # instruction
        "海绵宝宝的书法是不是叫做海绵体？", # input
        "", # output
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

# 弱智吧数据集
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

# Moss数据集
from datasets import load_dataset

# 定义常量
EOS_TOKEN = tokenizer.eos_token
alpaca_prompt = "{instruction}\n{input}\n{output}"  # 确保这个格式化字符串适用于你的任务

# 格式化函数适用于 `YeungNLP/moss-003-sft-data` 数据集
def formatting_moss_func(examples):
    conversations = examples["conversation"]
    texts = []
    for conversation in conversations:
        combined_text = ""
        for message in conversation:
            if "human" in message and "assistant" in message:
                instruction = message["human"]
                output = message["assistant"]
                combined_text += alpaca_prompt.format(instruction=instruction, input="", output=output) + EOS_TOKEN
        texts.append(combined_text)
    return {"text": texts}

# 加载数据集并格式化
dataset = load_dataset("YeungNLP/moss-003-sft-data", split="train")
dataset = dataset.map(formatting_moss_func, batched=True)
print(f"Dataset Length: {len(dataset)}")

# 检查数据集的条目数量
print(f"Dataset Example: {dataset[0]}")

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

#7测试微调后的模型
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "只用中文回答问题", # instruction
        "火烧赤壁 曹操为何不拨打119求救？", # input
        "", # output
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

#8保存LoRA模型
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # 在线保存到hugging face，需要token

#9合并模型并量化成4位gguf保存
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
#model.save_pretrained_merged("outputs", tokenizer, save_method = "merged_16bit",) #合并模型，保存为16位hf
#model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "") #合并4位gguf，上传到hugging face(需要账号token)

#10挂载google drive
from google.colab import drive
drive.mount('/content/drive')

#11复制模型到google drive
import shutil
source_file = '/content/model-unsloth.Q4_K_M.gguf'
destination_dir = '/content/drive/MyDrive/Llama3'
destination_file = f'{destination_dir}/model-unsloth.Q4_K_M.gguf'
shutil.copy(source_file, destination_file)

三数据集版本

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

#3微调前测试
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}
### Input:
{}
### Response:
{}"""

FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "请用中文回答", # instruction
        "海绵宝宝的书法是不是叫做海绵体？", # input
        "", # output
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

from datasets import load_dataset

# 定义常量
EOS_TOKEN = tokenizer.eos_token
alpaca_prompt = "{instruction}\n{input}\n{output}"  # 确保这个格式化字符串适用于你的任务

# 通用的格式化函数
def formatting_prompts_func(examples):
    instructions = examples.get("instruction", [])
    inputs = examples.get("input", [])
    outputs = examples.get("output", [])
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction=instruction, input=input_text, output=output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# 格式化函数适用于 `YeungNLP/moss-003-sft-data` 数据集
def formatting_moss_func(examples):
    conversations = examples["conversation"]
    texts = []
    for conversation in conversations:
        combined_text = ""
        for message in conversation:
            if "human" in message and "assistant" in message:
                instruction = message["human"]
                output = message["assistant"]
                combined_text += alpaca_prompt.format(instruction=instruction, input="", output=output) + EOS_TOKEN
        texts.append(combined_text)
    return {"text": texts}

# 格式化函数适用于 `YeungNLP/firefly-train-1.1M` 数据集
def formatting_firefly_func(examples):
    kinds = examples["kind"]
    inputs = examples["input"]
    outputs = examples["target"]
    texts = []
    for kind, input_text, output in zip(kinds, inputs, outputs):
        instruction = f"{kind}: {input_text}"
        text = alpaca_prompt.format(instruction=instruction, input="", output=output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# 加载数据集并格式化
dataset1 = load_dataset("kigner/ruozhiba-llama3-tt", split="train")
dataset1 = dataset1.map(formatting_prompts_func, batched=True)
print(f"Dataset 1 Length: {len(dataset1)}")

dataset2 = load_dataset("YeungNLP/moss-003-sft-data", split="train")
dataset2 = dataset2.map(formatting_moss_func, batched=True)
print(f"Dataset 2 Length: {len(dataset2)}")

dataset3 = load_dataset("YeungNLP/firefly-train-1.1M", split="train")
dataset3 = dataset3.map(formatting_firefly_func, batched=True)
print(f"Dataset 3 Length: {len(dataset3)}")

# 检查各个数据集的条目数量
print(f"Dataset 1 Example: {dataset1[0]}")
print(f"Dataset 2 Example: {dataset2[0]}")
print(f"Dataset 3 Example: {dataset3[0]}")

# 合并数据集
from datasets import concatenate_datasets

combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
print(f"Combined Dataset Length: {len(combined_dataset)}")

# 打印样本数据以验证格式化是否正确
print(combined_dataset[0])

#5设置训练参数
from trl import SFTTrainer
from transformers import TrainingArguments

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, #  建议 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0.1,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # 检查点，长上下文度
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,  # 使用合并后的数据集
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # 可以让短序列的训练速度提高5倍。
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        warmup_steps = 50,
        max_steps = 200,  # 微调步数
        learning_rate = 2e-4, # 学习率
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainer_stats = trainer.train()

单数据集结果

哥哥4岁，弟弟是哥哥一半，哥哥一百岁，弟弟多少岁？

原模型：

A classic lateral thinking puzzle! 😄
Let's break it down:
哥哥 (older brother) is 4 years old.
弟弟 (younger brother) 是哥哥一半 (is half of older brother), so younger brother is also 2 years old (since older brother is 4, and half of that is 2).
如果哥哥一百岁 (if older brother were 100 years old), then younger brother would be 50 years old (half of 100).
So, if older brother is actually 4 years old now, but we imagine him being 100 years old... then younger brother would also be half that age, which is 50.
But wait! This is a thought experiment. In reality, since older brother is only 4 years old, younger brother can't possibly be 50 years old either. The correct answer is simply "2" (since he's half of the older brother's age). 😊
😄
根据问题的描述，我们可以知道：
哥哥现在4岁。
弟弟是哥哥的一半，这意味着当哥哥4岁时，弟弟也只有2岁（因为1/2）。
如果哥哥一百岁，那么他已经活了96年（100 - 4 = 96），这意味着他出生的时候弟弟还没有出生。
因此，我们可以计算弟弟的年龄：从哥哥出生的那天起，过了96年的时间，这期间弟弟也在长大。根据这个过程，可以算出弟弟现在是102岁（2 + 96 = 98）。
😄
如果哥哥4岁，那么弟弟是一半，即2岁。
如果哥哥一百岁，那么弟弟还是一半，即50岁。

SFT模型：

当哥哥四岁时，弟弟两岁。因此，当哥哥一百岁时，弟弟将比他小96年。在这期间，弟弟将增加94岁。所以，当哥哥一百岁时，弟弟98岁。

THE END

最后修改：2024 年 07 月 20 日 14 : 25
本文链接：https://www.j000e.com/AI/llama3_sft.html
版权声明：本文『SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集』为『Joe』原创。著作权归作者所有。
转载说明：『SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集 || Joe's Blog』转载许可类型见文末右下角标识。允许规范转载时，转载文章需注明原文出处及地址。

Last modification：July 20, 2024

If you think my article is useful to you, please feel free to appreciate

SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集

Joe • 2024 年 05 月 15 日

<div class="tip share">请注意，本文编写于 426 天前，最后修改于 360 天前，其中某些信息可能已经过时。</div>

<blockquote><span class="external-link"><a class="no-external-link" href="https://www.notion.so/SFT-Llama-3-8B-Google-Colab-Unsloth-Hugging-Face-9a2cda9e41df49be89815f220e602f25?pvs=21" target="_blank"><i data-feather="external-link"></i>通过Notion查看本文</a></span><br><a href="https://www.j000e.com/AI/llama3_sft.html">本文同步发布在j000e.com</a></blockquote><div class="index-menu"><ul class="index-menu-list"><li class="index-menu-item"><a data-scroll class="index-menu-link current" href="#menu_index_1" title="数据集">数据集</a></li><li class="index-menu-item"><a data-scroll class="index-menu-link " href="#menu_index_2" title="单数据集版本">单数据集版本</a></li><li class="index-menu-item"><a data-scroll class="index-menu-link " href="#menu_index_3" title="三数据集版本">三数据集版本</a></li><li class="index-menu-item"><a data-scroll class="index-menu-link " href="#menu_index_4" title="单数据集结果">单数据集结果</a></li></ul></div><span class="menu-target-fix" id="menu_index_1" name="menu_index_1"></span><h1>数据集</h1><p><span class="external-link"><a class="no-external-link" href="https://huggingface.co/Rookie/Llama-3-8B-Instruct-Chinese" target="_blank"><i data-feather="external-link"></i>Rookie/Llama-3-8B-Instruct-Chinese · Hugging Face</a></span></p><span class="menu-target-fix" id="menu_index_2" name="menu_index_2"></span><h1>单数据集版本</h1><pre><code class="lang-python">%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install &quot;unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git&quot;
!pip install --no-deps xformers trl peft accelerate bitsandbytes</code></pre><pre><code class="lang-python">from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = &quot;unsloth/llama-3-8b-Instruct-bnb-4bit&quot;,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = &quot;hf_...&quot;, # use one if using gated models like meta-llama/Llama-2-7b-hf
)</code></pre><pre><code class="lang-python">#3微调前测试
alpaca_prompt = &quot;&quot;&quot;Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}
### Input:
{}
### Response:
{}&quot;&quot;&quot;

FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        &quot;请用中文回答&quot;, # instruction
        &quot;海绵宝宝的书法是不是叫做海绵体？&quot;, # input
        &quot;&quot;, # output
    )
], return_tensors = &quot;pt&quot;).to(&quot;cuda&quot;)

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)</code></pre><pre><code class="lang-python"># 弱智吧数据集
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples[&quot;instruction&quot;]
    inputs       = examples[&quot;input&quot;]
    outputs      = examples[&quot;output&quot;]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { &quot;text&quot; : texts, }
pass

from datasets import load_dataset
dataset = load_dataset(&quot;yahma/alpaca-cleaned&quot;, split = &quot;train&quot;)
dataset = dataset.map(formatting_prompts_func, batched = True,)</code></pre><pre><code class="lang-python"># Moss数据集
from datasets import load_dataset

# 定义常量
EOS_TOKEN = tokenizer.eos_token
alpaca_prompt = &quot;{instruction}\n{input}\n{output}&quot;  # 确保这个格式化字符串适用于你的任务

# 格式化函数适用于 `YeungNLP/moss-003-sft-data` 数据集
def formatting_moss_func(examples):
    conversations = examples[&quot;conversation&quot;]
    texts = []
    for conversation in conversations:
        combined_text = &quot;&quot;
        for message in conversation:
            if &quot;human&quot; in message and &quot;assistant&quot; in message:
                instruction = message[&quot;human&quot;]
                output = message[&quot;assistant&quot;]
                combined_text += alpaca_prompt.format(instruction=instruction, input=&quot;&quot;, output=output) + EOS_TOKEN
        texts.append(combined_text)
    return {&quot;text&quot;: texts}

# 加载数据集并格式化
dataset = load_dataset(&quot;YeungNLP/moss-003-sft-data&quot;, split=&quot;train&quot;)
dataset = dataset.map(formatting_moss_func, batched=True)
print(f&quot;Dataset Length: {len(dataset)}&quot;)

# 检查数据集的条目数量
print(f&quot;Dataset Example: {dataset[0]}&quot;)</code></pre><pre><code class="lang-python">model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number &gt; 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [&quot;q_proj&quot;, &quot;k_proj&quot;, &quot;v_proj&quot;, &quot;o_proj&quot;,
                      &quot;gate_proj&quot;, &quot;up_proj&quot;, &quot;down_proj&quot;,],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = &quot;none&quot;,    # Supports any, but = &quot;none&quot; is optimized
    # [NEW] &quot;unsloth&quot; uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = &quot;unsloth&quot;, # True or &quot;unsloth&quot; for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)</code></pre><pre><code class="lang-python">from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = &quot;text&quot;,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = &quot;adamw_8bit&quot;,
        weight_decay = 0.01,
        lr_scheduler_type = &quot;linear&quot;,
        seed = 3407,
        output_dir = &quot;outputs&quot;,
    ),
)</code></pre><pre><code class="lang-python">trainer_stats = trainer.train()</code></pre><pre><code class="lang-python">#7测试微调后的模型
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        &quot;只用中文回答问题&quot;, # instruction
        &quot;火烧赤壁 曹操为何不拨打119求救？&quot;, # input
        &quot;&quot;, # output
    )
], return_tensors = &quot;pt&quot;).to(&quot;cuda&quot;)

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)</code></pre><pre><code class="lang-python">#8保存LoRA模型
model.save_pretrained(&quot;lora_model&quot;) # Local saving
# model.push_to_hub(&quot;your_name/lora_model&quot;, token = &quot;...&quot;) # 在线保存到hugging face，需要token</code></pre><pre><code class="lang-python">#9合并模型并量化成4位gguf保存
model.save_pretrained_gguf(&quot;model&quot;, tokenizer, quantization_method = &quot;q4_k_m&quot;)
#model.save_pretrained_merged(&quot;outputs&quot;, tokenizer, save_method = &quot;merged_16bit&quot;,) #合并模型，保存为16位hf
#model.push_to_hub_gguf(&quot;hf/model&quot;, tokenizer, quantization_method = &quot;q4_k_m&quot;, token = &quot;&quot;) #合并4位gguf，上传到hugging face(需要账号token)</code></pre><pre><code class="lang-python">#10挂载google drive
from google.colab import drive
drive.mount('/content/drive')</code></pre><pre><code class="lang-python">#11复制模型到google drive
import shutil
source_file = '/content/model-unsloth.Q4_K_M.gguf'
destination_dir = '/content/drive/MyDrive/Llama3'
destination_file = f'{destination_dir}/model-unsloth.Q4_K_M.gguf'
shutil.copy(source_file, destination_file)</code></pre><span class="menu-target-fix" id="menu_index_3" name="menu_index_3"></span><h1>三数据集版本</h1><pre><code class="lang-python">%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install &quot;unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git&quot;
!pip install --no-deps xformers trl peft accelerate bitsandbytes</code></pre><pre><code class="lang-python">from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = &quot;unsloth/llama-3-8b-Instruct-bnb-4bit&quot;,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = &quot;hf_...&quot;, # use one if using gated models like meta-llama/Llama-2-7b-hf
)
</code></pre><pre><code class="lang-python">#3微调前测试
alpaca_prompt = &quot;&quot;&quot;Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}
### Input:
{}
### Response:
{}&quot;&quot;&quot;

# 定义常量
EOS_TOKEN = tokenizer.eos_token
alpaca_prompt = &quot;{instruction}\n{input}\n{output}&quot;  # 确保这个格式化字符串适用于你的任务

# 通用的格式化函数
def formatting_prompts_func(examples):
    instructions = examples.get(&quot;instruction&quot;, [])
    inputs = examples.get(&quot;input&quot;, [])
    outputs = examples.get(&quot;output&quot;, [])
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction=instruction, input=input_text, output=output) + EOS_TOKEN
        texts.append(text)
    return {&quot;text&quot;: texts}

# 格式化函数适用于 `YeungNLP/firefly-train-1.1M` 数据集
def formatting_firefly_func(examples):
    kinds = examples[&quot;kind&quot;]
    inputs = examples[&quot;input&quot;]
    outputs = examples[&quot;target&quot;]
    texts = []
    for kind, input_text, output in zip(kinds, inputs, outputs):
        instruction = f&quot;{kind}: {input_text}&quot;
        text = alpaca_prompt.format(instruction=instruction, input=&quot;&quot;, output=output) + EOS_TOKEN
        texts.append(text)
    return {&quot;text&quot;: texts}

# 加载数据集并格式化
dataset1 = load_dataset(&quot;kigner/ruozhiba-llama3-tt&quot;, split=&quot;train&quot;)
dataset1 = dataset1.map(formatting_prompts_func, batched=True)
print(f&quot;Dataset 1 Length: {len(dataset1)}&quot;)

dataset2 = load_dataset(&quot;YeungNLP/moss-003-sft-data&quot;, split=&quot;train&quot;)
dataset2 = dataset2.map(formatting_moss_func, batched=True)
print(f&quot;Dataset 2 Length: {len(dataset2)}&quot;)

dataset3 = load_dataset(&quot;YeungNLP/firefly-train-1.1M&quot;, split=&quot;train&quot;)
dataset3 = dataset3.map(formatting_firefly_func, batched=True)
print(f&quot;Dataset 3 Length: {len(dataset3)}&quot;)

# 检查各个数据集的条目数量
print(f&quot;Dataset 1 Example: {dataset1[0]}&quot;)
print(f&quot;Dataset 2 Example: {dataset2[0]}&quot;)
print(f&quot;Dataset 3 Example: {dataset3[0]}&quot;)

# 合并数据集
from datasets import concatenate_datasets

combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
print(f&quot;Combined Dataset Length: {len(combined_dataset)}&quot;)

# 打印样本数据以验证格式化是否正确
print(combined_dataset[0])
</code></pre><pre><code class="lang-python">#5设置训练参数
from trl import SFTTrainer
from transformers import TrainingArguments

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, #  建议 8, 16, 32, 64, 128
    target_modules = [&quot;q_proj&quot;, &quot;k_proj&quot;, &quot;v_proj&quot;, &quot;o_proj&quot;,
                      &quot;gate_proj&quot;, &quot;up_proj&quot;, &quot;down_proj&quot;,],
    lora_alpha = 32,
    lora_dropout = 0.1,
    bias = &quot;none&quot;,
    use_gradient_checkpointing = &quot;unsloth&quot;, # 检查点，长上下文度
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)</code></pre><pre><code class="lang-python">trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = combined_dataset,  # 使用合并后的数据集
    dataset_text_field = &quot;text&quot;,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # 可以让短序列的训练速度提高5倍。
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        warmup_steps = 50,
        max_steps = 200,  # 微调步数
        learning_rate = 2e-4, # 学习率
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = &quot;adamw_8bit&quot;,
        weight_decay = 0.01,
        lr_scheduler_type = &quot;linear&quot;,
        seed = 3407,
        output_dir = &quot;outputs&quot;,
    ),
)</code></pre><pre><code class="lang-python">trainer_stats = trainer.train()</code></pre><span class="menu-target-fix" id="menu_index_4" name="menu_index_4"></span><h1>单数据集结果</h1><blockquote>哥哥4岁，弟弟是哥哥一半，哥哥一百岁，弟弟多少岁？</blockquote><p>原模型：</p><blockquote><p>A classic lateral thinking puzzle! 😄</p><p>Let's break it down:</p><ul><li>哥哥 (older brother) is 4 years old.</li><li>弟弟 (younger brother) 是 哥哥 一半 (is half of older brother), so younger brother is also 2 years old (since older brother is 4, and half of that is 2).</li><li>如果 哥哥 一百岁 (if older brother were 100 years old), then younger brother would be 50 years old (half of 100).</li></ul><p>So, if older brother is actually 4 years old now, but we imagine him being 100 years old... then younger brother would also be half that age, which is 50.</p><p>But wait! This is a thought experiment. In reality, since older brother is only 4 years old, younger brother can't possibly be 50 years old either. The correct answer is simply "2" (since he's half of the older brother's age). 😊</p><p>😄</p><p>根据问题的描述，我们可以知道：</p><ul><li>哥哥现在4岁。</li><li>弟弟是哥哥的一半，这意味着当哥哥4岁时，弟弟也只有2岁（因为1/2）。</li><li>如果哥哥一百岁，那么他已经活了96年（100 - 4 = 96），这意味着他出生的时候弟弟还没有出生。</li></ul><p>因此，我们可以计算弟弟的年龄：从哥哥出生的那天起，过了96年的时间，这期间弟弟也在长大。根据这个过程，可以算出弟弟现在是102岁（2 + 96 = 98）。</p><p>😄</p><p>如果哥哥4岁，那么弟弟是一半，即2岁。</p><p>如果哥哥一百岁，那么弟弟还是一半，即50岁。</p></blockquote><p>SFT模型：</p><blockquote>当哥哥四岁时，弟弟两岁。因此，当哥哥一百岁时，弟弟将比他小96年。在这期间，弟弟将增加94岁。所以，当哥哥一百岁时，弟弟98岁。</blockquote>

SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集

数据集

单数据集版本

三数据集版本

单数据集结果

Leave a Comment Cancel reply
Our site uses cookies to store your personal information for faster commenting. By continuing to comment, you agree to this term.

国行三星电视换区享受国外流媒体

理解安卓应用的自启动/电池优化/休眠关系与FCM在大陆的运行条件实测

Cloudflare Workers: 反向代理 | 重定向

注册美区Apple ID步骤 | 添加付款方式没有“无(none)”

TeraCopy加速复制程序(注册码)

Adobe出品:程序员必备字体-SourceCodePro

Llama 3.1 405B vs GPT-4o: Which model is better?

Firefox拖拽网页中图片到本地格式为.jfif

网站搬家(Typecho)的根目录,MySQL,Nginx,内容等备份与还原

OpenWrt/LEDE(KoolShare)配置记录

SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集

数据集

单数据集版本

三数据集版本

单数据集结果

Leave a Comment Cancel reply Our site uses cookies to store your personal information for faster commenting. By continuing to comment, you agree to this term.

SFT 微调 Llama 3 8B | Google Colab 云端算力 + Unsloth 开源框架 + Hugging Face 多数据集

Leave a Comment Cancel reply
Our site uses cookies to store your personal information for faster commenting. By continuing to comment, you agree to this term.