Back to Blog
AIGeminiReactNode.jsMultimodal

Building AI Apps with Gemini 2.5 Pro: A Practical Guide

Gemini 2.5 Pro's multimodal reasoning capabilities unlock a new class of applications. Here's how I built a production video-to-notes tool and what I learned along the way.

February 10, 20252 min read

Why Gemini Over Other LLMs?

When I started building Videos to Notes App, I evaluated three leading models: GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro. The deciding factor wasn't raw text quality, it was native multimodal reasoning.

Gemini 2.5 Pro can reason across video frames, audio transcripts, and text in a single API call. No orchestration layer needed, no frame extraction pipeline to maintain.

import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-2.5-pro-preview" });

async function extractNotesFromVideo(videoUrl: string): Promise<string> {
  const result = await model.generateContent([
    {
      fileData: {
        mimeType: "video/mp4",
        fileUri: videoUrl,
      },
    },
    `Extract structured notes from this video. Format as:
     ## Key Topics
     ## Action Items
     ## Important Quotes
     ## Summary`,
  ]);

  return result.response.text();
}

The Architecture

The app follows a simple three-layer architecture:

  1. Client: React SPA that uploads video and streams notes
  2. API Route: Next.js Route Handler that calls Gemini
  3. Gemini: Handles multimodal processing and returns markdown
uploads video
    Client ──────────► /api/extract ──────► Gemini 2.5 Pro
                            │                     │
                            │◄────── markdown ─────┘
                            │
                        streams to UI

Streaming the Response

Long videos produce long notes. Streaming prevents the UI from feeling frozen:

// app/api/extract/route.ts
import { GoogleGenerativeAI } from "@google/generative-ai";

export async function POST(req: Request) {
  const { videoUri } = await req.json();
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
  const model = genAI.getGenerativeModel({ model: "gemini-2.5-pro-preview" });

  const stream = model.generateContentStream([
    { fileData: { mimeType: "video/mp4", fileUri: videoUri } },
    "Extract and structure the key notes from this video.",
  ]);

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of (await stream).stream) {
        const text = chunk.text();
        controller.enqueue(encoder.encode(text));
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

Rate Limits & Cost Management

Gemini 2.5 Pro is powerful but not free at scale. A few strategies I use:

  • Cache aggressively: identical video URIs shouldn't trigger a second API call
  • Chunk large videos: break anything over 20 minutes into segments
  • Use Flash for drafts: route low-priority requests to gemini-2.5-flash (10x cheaper)

What's Next

The multimodal space is evolving fast. My next exploration is real-time streaming meetings, piping a live audio stream into Gemini and surfacing action items as they happen.

If you want to explore the source code, it's open on GitHub.