WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

Large Language Models2年前 (2023)更新 Prompt engineer

18,348 0 90

Table of Contents

About WizardLM

WizardLM is a family of large language models that have been trained to follow complex instructions across domains like general conversation, coding, and math. The models use a novel training method called Evol-Instruct to automatically generate challenging instructions to improve performance. Key features of WizardLM models include multi-turn conversation, high accuracy on tasks like HumanEval, and mathematical reasoning compared to other open source models.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

The GitHub repo provides model checkpoints, demos, and documentation for WizardLM, WizardCoder, and WizardMath models – ranging from 1B to 70B parameters.

What is wizard coder?

In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code.

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions

WizardCoder

We released WizardCoder-Python-34B-V1.0 , which achieves the 73.2 pass@1 and surpasses GPT4 (2023/03/15), ChatGPT-3.5, and Claude2 on the HumanEval Benchmarks. For more details, please refer to WizardCoder.

WizardMath

Our WizardMath-70B-V1.0 model slightly outperforms some closed-source LLMs on the GSM8K, including ChatGPT 3.5, Claude Instant 1 and PaLM 2 540B.

WizardLM

We released WizardLM-70B-V1.0 model.

GPT-4 automatic evaluation

We adopt the automatic evaluation framework based on GPT-4 proposed by FastChat to assess the performance of chatbot models. As shown in the following figure, WizardLM-30B achieved better results than Guanaco-65B.

WizardLM-30B performance on different skills.

The following figure compares WizardLM-30B and ChatGPT’s skill on Evol-Instruct testset. The result indicates that WizardLM-30B achieves 97.8% of ChatGPT’s performance on average, with almost 100% (or more than) capacity on 18 skills, and more than 90% capacity on 24 skills.

WizardLM performance on NLP foundation tasks.

The following table provides a comparison of WizardLMs and other LLMs on NLP foundation tasks. The results indicate that WizardLMs consistently exhibit superior performance in comparison to the LLaMa models of the same size. Furthermore, our WizardLM-30B model showcases comparable performance to OpenAI’s Text-davinci-003 on the MMLU and HellaSwag benchmarks.

Model	MMLU 5-shot	ARC 25-shot	TruthfulQA 0-shot	HellaSwag 10-shot	Average
Text-davinci-003	56.9	85.2	59.3	82.2	70.9
Vicuna-13b 1.1	51.3	53.0	51.8	80.1	59.1
Guanaco 30B	57.6	63.7	50.7	85.1	64.3
WizardLM-7B 1.0	42.7	51.6	44.7	77.7	54.2
WizardLM-13B 1.0	52.3	57.2	50.5	81.0	60.2
WizardLM-30B 1.0	58.8	62.5	52.4	83.3	64.2

WizardLM performance on code generation.

The following table provides a comprehensive comparison of WizardLMs and several other LLMs on the code generation task, namely HumanEval. The evaluation metric is pass@1. The results indicate that WizardLMs consistently exhibit superior performance in comparison to the LLaMa models of the same size. Furthermore, our WizardLM-30B model surpasses StarCoder and OpenAI’s code-cushman-001. Moreover, our Code LLM, WizardCoder, demonstrates exceptional performance, achieving a pass@1 score of 57.3, surpassing the open-source SOTA by approximately 20 points.

Model	HumanEval Pass@1
LLaMA-7B	10.5
LLaMA-13B	15.8
CodeGen-16B-Multi	18.3
CodeGeeX	22.9
LLaMA-33B	21.7
LLaMA-65B	23.7
PaLM-540B	26.2
CodeGen-16B-Mono	29.3
code-cushman-001	33.5
StarCoder	33.6
WizardLM-7B 1.0	19.1
WizardLM-13B 1.0	24.0
WizardLM-30B 1.0	37.8
WizardCoder-15B 1.0	57.3