Phi-2 vs. Mixtral 8X7B : Don't Believe the Benchmarks

two_cute_robots_in_a_ring

With new contenders emerging every other day,two recent models that have generated significant buzz are Phi-2 from Microsoft and Mixtral 8X7B from Mistral AI. Both models boast impressive capabilities in various tasks, but which one comes out on top?

In the Red Corner: Phi-2

Phi-2 is a 2.7 billion parameter model that punches above its weight class. Despite its relatively small size, it has outperformed much larger models on several benchmarks, including:

Reasoning: Phi-2 excels at tasks that require logical thinking and common sense, such as solving puzzles and answering open-ended questions.

Language Understanding: Phi-2 demonstrates a deep understanding of natural language, allowing it to generate coherent and grammatically correct text, even in complex contexts.

Mathematics: Phi-2 can handle basic mathematical operations and solve simple word problems.

Coding: Phi-2 can understand and generate basic code, making it a valuable tool for programmers.

One of Phi-2's key strengths is its training data. Microsoft curated a dataset specifically designed to teach the model common sense reasoning across various domains. This targeted approach has resulted in a model that is not only knowledgeable but also capable of applying its knowledge to solve real-world problems.

Context Size: 2000 Tokens. 

Phi-2 requires less than 10GB of Video Memory or Less than 16GB of RAM if running from CPU.

However, take these benchmarks with a grain of salt. Based on our tests we have not noticed great results.

In the Blue Corner: Mixtral 8X7B

Mixtral 8X7B is a slightly larger model with 8X7B billion parameters. It is known for its:

Speed: Mixtral 8X7B is significantly faster than many other LLMs, making it ideal for real-time applications.

Efficiency: The model is also relatively efficient in terms of computational resources, making it more accessible to a wider range of users.

Open-source: Mixtral 8X7B is released under an open-source license, allowing for greater transparency and customization.

One of Mixtral 8X7B's unique features is its "mixture of experts" architecture. This approach combines the strengths of multiple smaller models, resulting in a more robust and versatile overall system.

Context Size: 32000 Tokens.

Model can do inferences using 40GB of VRAM.


So, who wins?

The answer, as always, depends on your specific needs and priorities. If you are looking for a small, efficient model that excels at reasoning and language understanding, then Phi-2 is a great choice. If speed and open-source accessibility are your top concerns, then Mixtral 8X7B might be a better fit.

Phi-2 Is better fit for text summarization and can be used as base model to be finetuned with more expert data. Its limited context size makes it difficult to use for large text summarization or for writing large sections of code. 

Mixtral 8x7B is close in performance to GPT-4 , however, its size makes it difficult to run locally and makes it challenging to finetune it.  

Our favorite model remains Mistral7B instruct v2 which is a good middle ground between Mixtral 8x7B and Phi-2



Comments

Popular Posts