WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

About WizardLM

WizardLM is a family of large language models that have been trained to follow complex instructions across domains like general conversation, coding, and math. The models use a novel training method called Evol-Instruct to automatically generate challenging instructions to improve performance. Key features of WizardLM models include multi-turn conversation, high accuracy on tasks like HumanEval, and mathematical reasoning compared to other open source models.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

The GitHub repo provides model checkpoints, demos, and documentation for WizardLM, WizardCoder, and WizardMath models – ranging from 1B to 70B parameters.

What is wizard coder?

In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code.

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions

WizardCoder

We released WizardCoder-Python-34B-V1.0 , which achieves the 73.2 pass@1 and surpasses GPT4 (2023/03/15)ChatGPT-3.5, and Claude2 on the HumanEval Benchmarks. For more details, please refer to WizardCoder.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

WizardMath

Our WizardMath-70B-V1.0 model slightly outperforms some closed-source LLMs on the GSM8K, including ChatGPT 3.5Claude Instant 1 and PaLM 2 540B.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

WizardLM

We released WizardLM-70B-V1.0 model.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

GPT-4 automatic evaluation

We adopt the automatic evaluation framework based on GPT-4 proposed by FastChat to assess the performance of chatbot models. As shown in the following figure, WizardLM-30B achieved better results than Guanaco-65B.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

WizardLM-30B performance on different skills.

The following figure compares WizardLM-30B and ChatGPT’s skill on Evol-Instruct testset. The result indicates that WizardLM-30B achieves 97.8% of ChatGPT’s performance on average, with almost 100% (or more than) capacity on 18 skills, and more than 90% capacity on 24 skills.

WizardLM Reviews:Empowering Large Pre-Trained Language Models to Follow Complex Instructions

WizardLM performance on NLP foundation tasks.

The following table provides a comparison of WizardLMs and other LLMs on NLP foundation tasks. The results indicate that WizardLMs consistently exhibit superior performance in comparison to the LLaMa models of the same size. Furthermore, our WizardLM-30B model showcases comparable performance to OpenAI’s Text-davinci-003 on the MMLU and HellaSwag benchmarks.

Model MMLU 5-shot ARC 25-shot TruthfulQA 0-shot HellaSwag 10-shot Average
Text-davinci-003 56.9 85.2 59.3 82.2 70.9
Vicuna-13b 1.1 51.3 53.0 51.8 80.1 59.1
Guanaco 30B 57.6 63.7 50.7 85.1 64.3
WizardLM-7B 1.0 42.7 51.6 44.7 77.7 54.2
WizardLM-13B 1.0 52.3 57.2 50.5 81.0 60.2
WizardLM-30B 1.0 58.8 62.5 52.4 83.3 64.2

WizardLM performance on code generation.

The following table provides a comprehensive comparison of WizardLMs and several other LLMs on the code generation task, namely HumanEval. The evaluation metric is pass@1. The results indicate that WizardLMs consistently exhibit superior performance in comparison to the LLaMa models of the same size. Furthermore, our WizardLM-30B model surpasses StarCoder and OpenAI’s code-cushman-001. Moreover, our Code LLM, WizardCoder, demonstrates exceptional performance, achieving a pass@1 score of 57.3, surpassing the open-source SOTA by approximately 20 points.

Model HumanEval Pass@1
LLaMA-7B 10.5
LLaMA-13B 15.8
CodeGen-16B-Multi 18.3
CodeGeeX 22.9
LLaMA-33B 21.7
LLaMA-65B 23.7
PaLM-540B 26.2
CodeGen-16B-Mono 29.3
code-cushman-001 33.5
StarCoder 33.6
WizardLM-7B 1.0 19.1
WizardLM-13B 1.0 24.0
WizardLM-30B 1.0 37.8
WizardCoder-15B 1.0 57.3

Tools similar to WizardLM

Bing Chat
Langfuse
LlamaIndex
Stable Beluga 2

© 版权声明

相关文章

暂无评论

您必须登录才能参与评论!
立即登录
暂无评论...