· · ·

Papers

DeepSeek-V3 Technical Report

DeepSeek-V3 · DeepSeek · 2024-12 · 197 bears

DeepSeek's mixture-of-experts large-language model with 671B total and 37B active parameters per token. The report details training efficiency innovations including FP8 mixed-precision training and a Multi-head Latent Attention architecture.

Qwen2.5 Technical Report

Qwen2.5 · Alibaba (Qwen team) · 2024-12 · 45 bears

Alibaba's Qwen team release covering open-weights models from 0.5B to 72B parameters, with extensions for coding (Qwen2.5-Coder) and mathematics (Qwen2.5-Math). The report describes pre-training data curation and post-training techniques.

The Llama 3 Herd of Models

Llama 3 · Meta AI / FAIR · 2024-07 · 557 bears

Meta's open-weights large-language-model release covering models from 8B to 405B parameters. The technical report runs to many hundreds of pages and credits hundreds of contributors across pre-training, post-training, safety, evaluation, and infrastructure work.

Apple Intelligence Foundation Language Models

Apple Foundation · Apple AFM team · 2024-07 · 154 bears

Apple's foundation-model report introducing AFM-on-device and AFM-server, the models powering Apple Intelligence. The report covers pre-training, post-training, and on-device deployment, with attention to privacy-preserving techniques.

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2 · Google DeepMind · 2024-06 · 197 bears

Google DeepMind's open-weights model family at 2B, 9B, and 27B parameters, distilled from the Gemini line. The report covers architectural choices including local–global attention interleaving and logit soft-capping.

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5 · Google DeepMind · 2024-02 · 911 bears

Google DeepMind's multimodal long-context model family, with a context window of up to one million tokens. The technical report describes the architecture, training, and evaluation across text, code, audio, image, and video.

Mixtral of Experts

Mixtral 8x7B · Mistral AI · 2024-01 · 26 bears

Mistral's sparse mixture-of-experts model with eight expert blocks per layer and two active per token. The report introduced an open-weights MoE architecture with strong performance at low active-parameter cost.