暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

LLM微调(一)| 单GPU使用QLoRA微调Llama 2.0实战

AI技术研习社 2024-07-11
2523

      最近LLaMA 2在LLaMA1 的基础上做了很多优化,比如上下文从2048扩展到4096,使用了Grouped-Query Attention(GQA)共享多头注意力的key 和value矩阵,具体可以参考:

关于LLaMA 2 的细节,可以参考如下文章:

Meta发布升级大模型LLaMA 2:开源可商用

揭秘最领先的Llama2中文大模型!

使用QLoRA微调LLaMA 2

安装环境

    pip install transformers datasets peft accelerate bitsandbytes safetensors

    导入库

      import os, sys
      import torch
      import datasets
      from transformers import (
      AutoTokenizer,
      AutoModelForCausalLM,
      BitsAndBytesConfig,
      DataCollatorForLanguageModeling,
      DataCollatorForSeq2Seq,
      Trainer,
      TrainingArguments,
      GenerationConfig
      )
      from peft import PeftModel, LoraConfig, prepare_model_for_kbit_training, get_peft_model

      导入LLaMA 2模型

        ### config ###
        model_id = "NousResearch/Llama-2-7b-hf" # optional meta-llama/Llama-2–7b-chat-hf
        max_length = 512
        device_map = "auto"
        batch_size = 128
        micro_batch_size = 32
        gradient_accumulation_steps = batch_size micro_batch_size


        # nf4" use a symmetric quantization scheme with 4 bits precision
        bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
        )


        # load model from huggingface
        model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        use_cache=False,
        device_map=device_map
        )


        # load tokenizer from huggingface
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

        输出模型的可训练参数量

          def print_number_of_trainable_model_parameters(model):
          trainable_model_params = 0
          all_model_params = 0
          for _, param in model.named_parameters():
          all_model_params += param.numel()
          if param.requires_grad:
          trainable_model_params += param.numel()
          print(f"trainable model parameters: {trainable_model_params}. All model parameters: {all_model_params} ")
          return trainable_model_params


          ori_p = print_number_of_trainable_model_parameters(model)


          # 输出
          # trainable model parameter: 262,410,240

          配置LoRA参数

            # LoRA config
            model = prepare_model_for_kbit_training(model)
            peft_config = LoraConfig(
            r=8,
            lora_alpha=32,
            lora_dropout=0.1,
            target_modules=["q_proj", "v_proj"],
            bias="none",
            task_type="CAUSAL_LM",
            )
            model = get_peft_model(model, peft_config)


            ### compare trainable parameters #
            peft_p = print_number_of_trainable_model_parameters(model)
            print(f"# Trainable Parameter \nBefore: {ori_p} \nAfter: {peft_p} \nPercentage: {round(peft_p ori_p * 100, 2)}")


            # 输出
            # trainable model parameter: 4,194,304

            r:更新矩阵的秩,也称为Lora注意力维度。较低的秩导致具有较少可训练参数的较小更新矩阵。增加r(不超过32)将导致更健壮的模型,但同时会导致更高的内存消耗。

            lora_lpha:控制lora比例因子

            target_modules:是一个模块名称列表,如“q_proj”和“v_proj“,用作LoRA模型的目标。具体的模块名称可能因基础模型而异。

            bias:指定是否应训练bias参数。可选参数为:“none”、“all”或“lora_only”。

            输出LoRA Adapter的参数,发现只占原模型的不到2%。

            在微调LLaMA 2之前,我们看一下LLaMA 2的生成效果

              ### generate ###
              prompt = "Write me a poem about Singapore."
              inputs = tokenizer(prompt, return_tensors="pt")
              generate_ids = model.generate(inputs.input_ids, max_length=64)
              print('\nAnswer: ', tokenizer.decode(generate_ids[0]))
              res = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
              print(res)

                     当要求模型写一首关于新加坡的诗时,产生的输出似乎相当模糊和重复,这表明模型很难提供连贯和有意义的回应。

              微调数据加载

              为了方便演示,我们使用开源的databricks/databricks-dolly-15k,数据格式如下:

                {
                'instruction': 'Why can camels survive for long without water?',
                'context': '',
                'response': 'Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.',
                'category': 'open_qa',
                }

                       要揭秘LLM能力,构建Prompt是至关重要,通常的Prompt形式有三个字段:Instruction、Input(optional)、Response。由于Input是可选的,因为这里设置了两种prompt_template,分别是有Input 的prompt_input和无Input 的prompt_no_input,代码如下:

                  max_length = 256
                  dataset = datasets.load_dataset(
                  "databricks/databricks-dolly-15k", split='train'
                  )


                  ### generate prompt based on template ###
                  prompt_template = {
                  "prompt_input": \
                  "Below is an instruction that describes a task, paired with an input that provides further context.\
                  Write a response that appropriately completes the request.\
                  \n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",


                  "prompt_no_input": \
                  "Below is an instruction that describes a task.\
                  Write a response that appropriately completes the request.\
                  \n\n### Instruction:\n{instruction}\n\n### Response:\n",


                  "response_split": "### Response:"
                  }


                  def generate_prompt(instruction, input=None, label=None, prompt_template=prompt_template):
                  if input:
                  res = prompt_template["prompt_input"].format(
                  instruction=instruction, input=input)
                  else:
                  res = prompt_template["prompt_no_input"].format(
                  instruction=instruction)
                  if label:
                  res = f"{res}{label}"
                  return res

                        使用generate_prompt函数把instruction, context和response拼接起来;然后进行tokenize分词处理,转换为input_ids和attention_mask,为了让模型可以预测下一个token,设计了类似input_ids的labels便于右移操作;

                    def tokenize(tokenizer, prompt, max_length=max_length, add_eos_token=False):
                    result = tokenizer(
                    prompt,
                    truncation=True,
                    max_length=max_length,
                    padding=False,
                    return_tensors=None)


                    result["labels"] = result["input_ids"].copy()
                    return result


                    def generate_and_tokenize_prompt(data_point):
                    full_prompt = generate_prompt(
                    data_point["instruction"],
                    data_point["context"],
                    data_point["response"],
                    )
                    tokenized_full_prompt = tokenize(tokenizer, full_prompt)
                    user_prompt = generate_prompt(data_point["instruction"], data_point["context"])
                    tokenized_user_prompt = tokenize(tokenizer, user_prompt)
                    user_prompt_len = len(tokenized_user_prompt["input_ids"])
                    mask_token = [-100] * user_prompt_len
                    tokenized_full_prompt["labels"] = mask_token + tokenized_full_prompt["labels"][user_prompt_len:]
                    return tokenized_full_prompt


                    dataset = dataset.train_test_split(test_size=1000, shuffle=True, seed=42)
                    cols = ["instruction", "context", "response", "category"]
                    train_data = dataset["train"].shuffle().map(generate_and_tokenize_prompt, remove_columns=cols)
                    val_data = dataset["test"].shuffle().map(generate_and_tokenize_prompt, remove_columns=cols,)

                    模型训练

                      args = TrainingArguments(
                      output_dir="./llama-7b-int4-dolly",
                      num_train_epochs=20,
                      max_steps=200,
                      fp16=True,
                      optim="paged_adamw_32bit",
                      learning_rate=2e-4,
                      lr_scheduler_type="constant",
                      per_device_train_batch_size=micro_batch_size,
                      gradient_accumulation_steps=gradient_accumulation_steps,
                      gradient_checkpointing=True,
                      group_by_length=False,
                      logging_steps=10,
                      save_strategy="epoch",
                      save_total_limit=3,
                      disable_tqdm=False,
                      )


                      trainer = Trainer(
                      model=model,
                      train_dataset=train_data,
                      eval_dataset=val_data,
                      args=args,
                      data_collator=DataCollatorForSeq2Seq(
                      tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True),
                      )


                      # silence the warnings. re-enable for inference!
                      model.config.use_cache = False
                      trainer.train()
                      model.save_pretrained("llama-7b-int4-dolly")

                      模型测试

                             模型训练几个小时结束后,我们合并预训练模型Llama-2–7b-hf和LoRA参数,我们还是以“Write me a poem about Singapore”测试效果,代码如下:

                        # model path and weight
                        model_id = "NousResearch/Llama-2-7b-hf"
                        peft_path = "./llama-7b-int4-dolly"


                        # loading model
                        model = AutoModelForCausalLM.from_pretrained(
                        model_id,
                        quantization_config=bnb_config,
                        use_cache=False,
                        device_map="auto"
                        )


                        # loading peft weight
                        model = PeftModel.from_pretrained(
                        model,
                        peft_path,
                        torch_dtype=torch.float16,
                        )
                        model.eval()


                        # generation config
                        generation_config = GenerationConfig(
                        temperature=0.1,
                        top_p=0.75,
                        top_k=40,
                        num_beams=4, # beam search
                        )


                        # generating reply
                        with torch.no_grad():
                        prompt = "Write me a poem about Singapore."
                        inputs = tokenizer(prompt, return_tensors="pt")
                        generation_output = model.generate(
                        input_ids=inputs.input_ids,
                        generation_config=generation_config,
                        return_dict_in_generate=True,
                        output_scores=True,
                        max_new_tokens=64,
                        )
                        print('\nAnswer: ', tokenizer.decode(generation_output.sequences[0]))

                        生成模型中参数temperaturetop-ktop-pnum_beam含义可以参考:https://github.com/ArronAI007/Awesome-AGI/blob/main/LLM%E4%B9%8BGenerate%E4%B8%AD%E5%8F%82%E6%95%B0%E8%A7%A3%E8%AF%BB.ipynb

                        参考文献:

                        [1] https://ai.plainenglish.io/fine-tuning-llama2-0-with-qloras-single-gpu-magic-1b6a6679d436

                        [2] https://github.com/ChanCheeKean/DataScience/blob/main/13%20-%20NLP/E04%20-%20Parameter%20Efficient%20Fine%20Tuning%20(PEFT).ipynb

                        文章转载自AI技术研习社,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                        评论