Gemma 2 with twice the performance, how to play Llama 3 with the same level?
On the AI track, technology giants compete fiercely. The GPT-4o came out on the front foot, and the Claude 3.5 Sonnet appeared on the back foot. In such a fierce battle, although Google launched its efforts late, it has significant ability to follow up in a short period of time, which shows its potential for technological development and innovation. In addition to the Gemini model, Gemma, a series of lightweight SOTA open models, seems to be closer to us. It is built on the same research and technology as the Gemini model and aims to give everyone the tools to build AI. Google continues to expand the Gemma family to include CodeGemma, RecurrentGemma, and PaliGemma—each model provides unique capabilities for different AI tasks and is easily accessible through partners such as Hugging Face, NVIDIA, and Ollama.
Now, the Gemma family welcomes a new member - Gemma 2, continuing the tradition of being short and concise. The two versions of 9 billion (9B) and 27 billion (27B) parameters provided by Gemma 2 this time have better inference performance and efficiency than the first generation, and have significant security improvements. In fact, the 27 billion parameter version can compete at the same level with models that are more than twice the size and provide performance previously only achieved by proprietary models that can now be achieved on a single NVIDIA H100 Tensor Core GPU or TPU Implemented on the host, thus greatly reducing deployment costs.
The Google team built Gemma 2 on a redesigned architecture, allowing this new member of the Gemma family to provide both excellent performance and efficient inference capabilities. To briefly summarize, performance, cost, inference are its outstanding features:
- Excellent performance: The Gemma 2 27B model offers the best performance in its volume category, even competing with models more than twice its size Model competition. The 9B Gemma 2 model also performed well in its size category and outperformed the Llama 3 8B and other comparable open models.
- High efficiency, low cost: The 27B Gemma 2 model is designed to efficiently run inference at full precision on a single Google Cloud TPU host, NVIDIA A100 80GB Tensor Core GPU, or NVIDIA H100 Tensor Core GPU, while maintaining high performance Dramatically reduce costs. This makes AI deployment more convenient and affordable.
- Ultra-fast inference: Gemma 2 is optimized to run at blazing speeds on a variety of hardware, whether it’s a powerful gaming laptop, a high-end desktop, or a cloud-based setup. Users can try running Gemma 2 at full precision on Google AI Studio, or use a quantized version of Gemma.cpp on the CPU to unlock local performance, or try it on a home computer using NVIDIA RTX or GeForce RTX via Hugging Face Transformers.
The above is the score data comparison between Gemma2, Llama3 and Grok-1.
In fact, judging from various score data, the advantages of the open source 9B large model are not particularly obvious. The large domestic model GLM-4-9B, which was open sourced by Zhipu AI nearly a month ago, has even more advantages.
Additionally, Gemma 2 is not only more powerful, but also designed to be easier to integrate into workflows. Google is giving developers more possibilities to build and deploy AI solutions more easily.
- Open and accessible: Like the original Gemma model, Gemma 2 allows developers and researchers to share and commercialize innovations.
- Broad framework compatibility: Gemma 2 is compatible with major AI frameworks such as Hugging Face Transformers, as well as JAX, PyTorch and TensorFlow natively supported through Keras 3.0, vLLM, Gemma.cpp, Llama.cpp and Ollama, making it Easily integrates with user-preferred tools and workflows. In addition, Gemma has been optimized with NVIDIA TensorRT-LLM and can run on NVIDIA accelerated infrastructure or as an NVIDIA NIM inference microservice. It will also be optimized for NVIDIA's NeMo in the future and can be fine-tuned using Keras and Hugging Face. In addition, Google is actively upgrading fine-tuning capabilities.
- Easy Deployment: Starting next month, Google Cloud customers will be able to easily deploy and manage Gemma 2 on Vertex AI.
Google is also offering a new Gemma Cookbook, a series of practical examples and guides designed to help users build their own applications and fine-tune Gemma 2 models for specific tasks. Gemma Cookbook link: https://github.com/google-gemini/gemma-cookbookAt the same time, Google also provided developers with the official product announced at the I/O conference some time ago. Gemini 1.5 Pro’s 2 million context window access, code execution capabilities for the Gemini API, and the addition of Gemma 2 in Google AI Studio.
- In the latest blog, Google announced that it has opened Gemini 1.5 Pro’s 2 million token context window access to all developers. However, as the context window increases, the input cost may also increase. In order to help developers reduce the cost of multiple prompt tasks using the same token, Google has thoughtfully launched the context caching function in the Gemini API for Gemini 1.5 Pro and 1.5 Flash.
- To solve the problem that large language models need to generate and execute code to improve accuracy when processing mathematics or data reasoning, Google has enabled code execution in Gemini 1.5 Pro and 1.5 Flash. When turned on, the model can dynamically generate and run Python code and learn iteratively from the results until the desired final output is achieved. The execution sandbox does not connect to the Internet and comes standard with some numerical libraries. Developers only need to be billed based on the model's output token. This is the first time Google has introduced code execution as a step in model functionality, available today through the Gemini API and Advanced Settings in Google AI Studio.
- Google wants to make AI accessible to all developers, whether integrating Gemini models through API keys or using the open model Gemma 2. To help developers get their hands on the Gemma 2 model, the Google team will make it available for experimentation in Google AI Studio.
The following is the technical experiment report of Gemma2. We can analyze the technical details in depth from multiple angles.
- Paper address: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
- Blog address: https://blog.google/ technology/developers/google-gemma-2/
Similar to the previous Gemma model, the Gemma 2 model is also based on a decoder-only transformer architecture. Table 1 summarizes the main parameters and architectural choices of the model. Some structural elements are similar to the first version of the Gemma model, namely the context length is 8192 tokens, the use of rotated position embedding (RoPE) and approximate GeGLU nonlinearity. Gemma 1 and Gemma 2 have some differences, including the use of deeper networks. The main differences are summarized as follows:
- Local sliding window and global attention. The research team alternated using local sliding window attention and global attention in every other layer. The sliding window size of the local attention layer is set to 4096 tokens, while the span of the global attention layer is set to 8192 tokens.
- Logit soft cap. According to the method of Gemini 1.5, the research team limits logit at each attention layer and the final layer so that the value of logit remains between −soft_cap and +soft_cap.
- For the 9B and 27B models, the research team set the logarithmic cap of attention to 50.0 and the final logarithmic cap to 30.0. As of the time of publication, attention logit soft capping is incompatible with common FlashAttention implementations, so they have removed this feature from libraries that use FlashAttention. The research team conducted ablation experiments on model generation with and without attention logit soft capping, and found that the generation quality was almost unaffected in most pre-training and post-evaluation. All evaluations in this paper use the full model architecture including attention logit soft capping. However, some downstream performance may still be slightly affected by this removal.
- Use RMSNorm for post-norm and pre-norm. In order to stabilize training, the research team used RMSNorm to normalize the input and output of each transformation sub-layer, attention layer and feed-forward layer.
- Query attention in groups. Both 27B and 9B models use GQA, num_groups = 2, and ablation-based experiments show improved inference speed while maintaining downstream performance.
Google provides a brief overview of the pre-training part that is different from Gemma 1. They trained Gemma 2 27B on 13 trillion tokens, mainly English data, trained the 9B model on 8 trillion tokens, and trained the 2.6B model on 2 trillion tokens. train. These tokens come from a variety of data sources, including web documents, code, and scientific articles. The model is not multimodal, nor is it specifically trained for state-of-the-art multilingual capabilities. The final data mix is determined through an ablation study similar to Gemini 1.0. The research team uses TPUv4, TPUv5e and TPUv5p for model training. The details are shown in Table 3 below. In post-training, Google fine-tunes the pre-trained model into an instruction-tuned model.
- First, apply supervised fine-tuning (SFT) on a mixture of text-only, English-only synthesis, and artificially generated prompt-response pairs.
- Then, reinforcement learning based on the reward model (RLHF) is applied on these models. The reward model is trained on token-based pure English preference data, and the strategy uses the same prompt as the SFT stage.
- Finally, improve the overall performance by averaging the models obtained at each stage. The final data mixing and post-training methods, including tuned hyperparameters, are chosen based on minimizing model hazards related to safety and hallucinations while increasing model usefulness.
The fine-tuning of the Gemma 2 model uses a different format mode from the Gemma 1 model. Google uses the same control token as described in Table 4, and an example conversation is provided in Table 5. Experiments and EvaluationIn Table 6 it can be seen that refining the results from a larger model improves performance compared to training from scratch. It should be noted that 500B tokens is 10 times the optimal number of calculated tokens for the 2.6B model. The research team performed distillation from the 7B model to maintain a similar ratio as distilling from the 27B model to the 9B model. In Table 7, the Google team measures the impact of distillation as model size increases. It can be observed that this gain persists as the model size increases. In this ablation experiment, the research team kept the teacher model size at 7B and trained smaller models to simulate the gap between the final teacher and student model sizes. In addition, Google took into account the impact of prompt/evaluation format changes and measured the performance variance on MMLU, as shown in Table 11. The Gemma 2B model is slightly inferior to larger models in terms of format robustness. It is worth noting that the Mistral 7B is significantly lower than the Gemma series models in terms of robustness. The research team also evaluated the performance of the 27B model trained on 13 trillion tokens (without distillation) and compared it with the similarly sized Qwen1.5 34B model and the 2.5 times larger LLaMA-3 70B The performance of the models on the HuggingFace evaluation suite was compared, and the evaluation results are listed in Table 12. Models were selected based on their ranking on the HuggingFace leaderboard. Overall, the Gemma-2 27B model performs best in its size category and can even compete with larger models that take longer to train. The Gemma-2 27B and 9B instruction fine-tuning models were blindly evaluated in Chatbot Arena by human evaluators against other SOTA models. The research team reports the ELO scores in Figure 1. In addition, the research team evaluated the multi-turn dialogue capabilities of the Gemma 1.1 7B, Gemma 2 9B and 27B models by having human evaluators talk to the models and following specified scenarios for testing. Google uses a diverse holding set of 500 scenarios, each describing a series of requests for the model, including brainstorming, making a plan, or learning something new. The average number of user interactions is 8.4. Finally, it was found that compared with Gemma 1.1, users rated the Gemma 2 model’s dialogue satisfaction and dialogue goal achievement rate as significantly higher (see Table 15). In addition, the Gemma 2 model is better able to maintain high-quality responses from the beginning of the conversation to subsequent rounds than the Gemma 1.1 7B model. For more details, please read the original paper. The above is the detailed content of Google's 'sincere work', open source 9B and 27B versions of Gemma2, focusing on efficiency and economy!. For more information, please follow other related articles on the PHP Chinese website!