Build an ElevenLabs Clone: PyTorch, Next.js 15, AWS, Inngest, FastAPI, React, Tailwind (2025)

Andreas Trolle • March 26, 2025

Andreas Trolle

About

Hi, my name is Andreas and I'm a Software Engineer from Denmark. I build complex and interesting full-stack projects from scratch and break them down for you to learn. My goal is to teach you coding to the best of my abilities. I have a bachelor's degree in SE and 5 years of industry experience.

Latest Posts

Build an AI Music Generation SaaS: Python, Next.js, AWS, Polar, Tailwind, TS, Modal, Inngest (2025)

Andreas Trolle

Build an AI Podcast Clipping SaaS: Python, Next.js, AWS, Stripe, Tailwind, TS, Modal, Inngest (2025)

Andreas Trolle

Build and Deploy a Biotech AI Tool: Python, Next.js 15, React, Tailwind, Modal, Typescript (2025)

Andreas Trolle

(2/2) Build an ElevenLabs Clone: PyTorch, Next.js 15, AWS, Inngest, FastAPI, React, Tailwind (2025)

Andreas Trolle

Video Description

Source code: https://github.com/Andreaswt/elevenlabs-clone Part 2: https://www.youtube.com/watch?v=9kkcaPiNjHU Discord & More: https://andreastrolle.com Inngest: https://innge.st/yt-andreas-1 Hi 🤙 In this video, you'll build a full-stack ElevenLabs clone with text-to-speech, voice conversion, and audio generation. Some tutorials would just call an API like ElevenLabs', but not us! Instead of external API services, you'll self-host three AI models (StyleTTS2, Seed-VC, and Make-An-Audio) from GitHub, fine-tune them to specific voices, then containerize them with Docker and expose inference endpoints via FastAPI. The AI backend will be built using Python and PyTorch. You'll create a Next.js application where users can use the AI models to generate audio, and also switch between voices and view previously generated audio files, stored in an S3 bucket. The project includes user authentication, a credit system, and an Inngest queue to prevent overloading of the server hosting the AI models. The web application is built on the T3 Stack with Next.js, React, Tailwind, and Auth.js. Follow along for the entire process from development to deployment. Features 🔊 Text-to-speech synthesis with StyleTTS2 🎭 Voice conversion with Seed-VC 🎵 Audio generation from text with Make-An-Audio 🤖 Custom voice fine-tuning capabilities 🐳 Docker containerization of AI models 🚀 FastAPI backend endpoints 🔄 Inngest queue to prevent server overload 📊 User credit management system 💾 AWS S3 for audio file storage 👥 Multiple pre-trained voice models 📱 Responsive Next.js web interface 🔐 User authentication with Auth.js 🎛️ Voice picker 📝 Generated audio history 🎨 Modern UI with Tailwind CSS 💲 Costs + How to follow along for free The total fine-tuning cost for both models is ~5-10 USD. When deploying the endpoint it’s ~1 USD per hour of uptime. S3 is really cheap. IAM roles, users etc are free. Following along for free: -When building the next.js application in part 2 of the video, I create a mock endpoint that means you don't have to host the AI models with EC2 unless you want to learn it. You can just use that mock endpoint throughout the video. -Don't fine-tune the models, but just use the model files (.pth) made by the researchers, as I also do before fine-tuning. -Don't create EC2 instances. They are the main cost driver. -S3 buckets are required for the voice-to-voice feature. If you want this feature, stay within the 5GB free tier. See more under storage and S3 here: https://aws.amazon.com/free -You can of course still follow the video, learn the concepts, and code along. You can also test the docker containers for training locally, without training the model, to learn as much as possible, without the actual fine-tuning. 📖 Chapters 00:00:00 Demo 00:03:45 Theory and Plan 00:46:34 Python installation 00:47:50 TTS w. StyleTTS2 01:28:40 Preparing dataset 01:45:03 Fine-tune preparation 02:22:28 AWS setup 02:32:25 EC2 fine-tuning 02:49:21 API for TTS 03:49:29 Voice-changer w. seed-vc 04:02:51 Seed-vc fine-tuning 04:12:14 Seed-vc API 04:52:42 Text-to-SFX w. make-an-audio 05:00:21 Text-to-SFX API 05:18:19 Docker-compose 05:21:05 Deploying AIs to EC2

Build an ElevenLabs Clone: PyTorch, Next.js 15, AWS, Inngest, FastAPI, React, Tailwind (2025)

Andreas Trolle

About

Latest Posts

Build an AI Music Generation SaaS: Python, Next.js, AWS, Polar, Tailwind, TS, Modal, Inngest (2025)

Build an AI Podcast Clipping SaaS: Python, Next.js, AWS, Stripe, Tailwind, TS, Modal, Inngest (2025)

Build and Deploy a Biotech AI Tool: Python, Next.js 15, React, Tailwind, Modal, Typescript (2025)

(2/2) Build an ElevenLabs Clone: PyTorch, Next.js 15, AWS, Inngest, FastAPI, React, Tailwind (2025)

Video Description

You May Also Like

Stock Your Baking Pantry

Under Eye Patches, Dark Circles Under Eye Treatment, Reduce Wrinkles &amp; BagsUnder Eye, Under Eye Mask Gel Pads Improve and Firm eye Skin, Using Pure Natural Extracts - 60Pcs

KAMRUI AK1PLUS Mini PC W-11 Pro, 12th Gen N95 Processor (up to 3.4GHz), Mini Desktop Computers with 16GB RAM 512GB M.2 2.5&quot; SSD, Mini Computer Support Dual 4K UHD for Home Business Office

KAMRUI Mini PC W-11 Pro, 16GB RAM 1TB M.2 SSD, Mini Computers with 12 Gen Ν95 Processor (up to 3.4GHz), Mini Desktop Computers Support 2.5&quot; SSD/WiFi/BT/Dual 4k Display

KAMRUI AK1PLUS Mini PC, 12th Gen N97 Processor (up to 3.60GHz) Mini Desktop Computers with 16GB DDR4 RAM 512GB SSD, Mini PC Support 4K UHD/WiFi/BT/Dual HDMI LAN for Home School Office

DreamQuest Mini PC Equipped with N100 Processor (up to 3.4GHz) 16GB RAM + 512GB SSD Mini Desktop PC with Type-C (20 Gbit/s) 4K HDMI,WiFi 5,BT5.2,Gigabit Ethernet

Loading...

Under Eye Patches, Dark Circles Under Eye Treatment, Reduce Wrinkles & BagsUnder Eye, Under Eye Mask Gel Pads Improve and Firm eye Skin, Using Pure Natural Extracts - 60Pcs

KAMRUI AK1PLUS Mini PC W-11 Pro, 12th Gen N95 Processor (up to 3.4GHz), Mini Desktop Computers with 16GB RAM 512GB M.2 2.5" SSD, Mini Computer Support Dual 4K UHD for Home Business Office

KAMRUI Mini PC W-11 Pro, 16GB RAM 1TB M.2 SSD, Mini Computers with 12 Gen Ν95 Processor (up to 3.4GHz), Mini Desktop Computers Support 2.5" SSD/WiFi/BT/Dual 4k Display