The parallelism scheme is similar to the original Megatron-LM, which is efficient on TPUs due to the high speed 2d mesh network. This library is desig

kingoflolz / mesh-transformer-jax

submited by

Style Pass

2021-06-09 02:30:05

The parallelism scheme is similar to the original Megatron-LM, which is efficient on TPUs due to the high speed 2d mesh network.

This library is designed for scalability up to approximately 20B parameters on TPUv3s, beyond which different parallelism strategies should be used. See other implementations such as GPT-NeoX or DeepSpeed for that.

One future direction for research is integrating this codebase with swarm-jax, to achieve further scalability with pipeline parallelism.

This project would not have been possible without compute generously provided by the TPU Research Cloud with assistance from EleutherAI.

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary position encodings (RoPE) was applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.

* represents evaluation numbers reported by their respective authors, all other numbers are provided by running the lm-evaluation-harness either with the released weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these might not be directly comparable. See this blog post for more details.

kingoflolz / mesh-transformer-jax

Leave a Comment

Related Posts

Recent Posts

Instant ADHD Assessment Quick, Affordable, Online!

Computer Science > Social and Information Networks

Garry's Mod Deleting 20 Years of Content After Nintendo Sends Takedown Notices

Package management on macOS with nix-darwin

Want to save money with Kubernetes? Try Proactive Capacity Planning

Equivalence of morphisms under substitution

Learn Python Online From Scratch: Your Complete Guide to Starting Your Programming Journey

Reduce LLM Hallucinations by 80%

whoa there, pardner!

Radius is Now a Cloud Native Compute Foundation (CNCF) Sandbox Project

As the climate changes, cities scramble to find trees that will survive

Apple Partner TSMC Unveils Advanced 1.6nm Process for 2026 Chips

India: A Beacon of Hope and Progress in a Volatile World

Superfood protein pulled out of thin air massively scales up production

On actually reading the book – Nate Meyvis

A Plastic Free Guide to Materials & Fabrics

Revoking vulnerable Windows boot managers

The review platform for freelancers

NIST Cybersecurity Framework 2.0 – What’s New & What You Need to Know

Intel and ExxonMobil working on advanced liquid cooling — laying groundwork for 2000W TDP Xeon chips