Llama-3.1 Announcement
We are happy to announce that we have brought up support for Llama-3.1-70B inference on Tenstorrent’s 8-chip systems, the TT-QuietBox and the TT-LoudBox.
The source code for Llama-3.1-70B and other models that are supported is on our GitHub. We have also merged support for Llama-3.1-8B, running on our single-chip n150 card.
Implementation highlights:
- Fractured with 8-way tensor parallelism
- Uses FlashAttention and FlashDecode
- Uses Mixed BF16, BFP8, and BFP4 precision
- Performance was measured in eager mode with tracing disabled
We are working on optimizations which will get us to our target of 20 tokens/second/user. Buy our 8-chip systems (TT-QuietBox and TT-LoudBox) to try Llama-3.1-70B at home on Tenstorrent hardware!
Other articles
Tenstorrent closes $693M+ of Series D funding led by Samsung Securities and AFW Partners
Santa Clara, CA: Tenstorrent is announcing that it has closed over $693M in its Series D funding round at a pre-money valuation of $2B.
Community Highlight: Tenstorrent Wormhole Series Part 2: Which disabled rows?
An in depth look at Tenstorrent Wormhole, originally posted on corsix.org
AI Software Startup Moreh Partners with AI Semiconductor Company Tenstorrent to Challenge NVIDIA in AI Data Center Market
Joint R&D of AI data center solutions by integrating tenstorrent's semiconductors with Moreh's software; Targeting NVIDIA dominant datacenter market with top-notch performance solutions