· · ·
Papers
DeepSeek-V3 · DeepSeek · 2024-12 · 197 bears
DeepSeek's mixture-of-experts large-language model with 671B total and 37B active parameters per token. The report details training efficiency innovations including FP8 mixed-precision training and a Multi-head Latent Attention architecture.
Qwen2.5 · Alibaba (Qwen team) · 2024-12 · 45 bears
Alibaba's Qwen team release covering open-weights models from 0.5B to 72B parameters, with extensions for coding (Qwen2.5-Coder) and mathematics (Qwen2.5-Math). The report describes pre-training data curation and post-training techniques.
Llama 3 · Meta AI / FAIR · 2024-07 · 557 bears
Meta's open-weights large-language-model release covering models from 8B to 405B parameters. The technical report runs to many hundreds of pages and credits hundreds of contributors across pre-training, post-training, safety, evaluation, and infrastructure work.
Apple Foundation · Apple AFM team · 2024-07 · 154 bears
Apple's foundation-model report introducing AFM-on-device and AFM-server, the models powering Apple Intelligence. The report covers pre-training, post-training, and on-device deployment, with attention to privacy-preserving techniques.
Gemma 2 · Google DeepMind · 2024-06 · 197 bears
Google DeepMind's open-weights model family at 2B, 9B, and 27B parameters, distilled from the Gemini line. The report covers architectural choices including local–global attention interleaving and logit soft-capping.
Gemini 1.5 · Google DeepMind · 2024-02 · 911 bears
Google DeepMind's multimodal long-context model family, with a context window of up to one million tokens. The technical report describes the architecture, training, and evaluation across text, code, audio, image, and video.
Mixtral 8x7B · Mistral AI · 2024-01 · 26 bears
Mistral's sparse mixture-of-experts model with eight expert blocks per layer and two active per token. The report introduced an open-weights MoE architecture with strong performance at low active-parameter cost.