I Made Three AIs Watch My Videos So You Don't Have To - One Actually Paid Attention

Gemini watches your videos and gets them; ChatGPT needs Codex to hold its hand; Claude charges $100/month to say 'I can't do that.' A tale of three AIs, one silent drone test, and some truly questionable thumbnails.

Let's be honest: most of us have better things to do than watch our own old YouTube videos. Fortunately, artificial intelligence is here to handle that existential dread for us. In a recent test, I subjected ChatGPT, Claude, and Gemini to the grueling task of actually understanding what happens in a video - both from YouTube links and local files. The results ranged from "impressively perceptive" to "I'm sorry, I can't do that, Dave."

I fed each AI three videos: a YouTube explainer about the scientific process of annealing (yes, I'm that exciting), a silent MP4 of me gesturing at a DJI Neo 2 drone, and a 1.65GB MOV file of me walking and talking about my YouTube posting strategy - no metadata, no transcripts, just pure, unadulterated me. The prompt was refreshingly simple: "Can you watch this video?" Because apparently asking them to "understand" or "summarize" just sends them hunting for metadata like digital raccoons.

Let's get the bad news out of the way first. Claude - whether on the app or web interface - was a polite but firm brick wall. It informed me, in so many words, that it cannot watch video content directly, cannot process visual or audio frames, and generally has all the video-watching capability of a toaster. Claude Max, at $100 per month, apparently buys you a very well-spoken refusal.

Gemini, on the other hand, was the overachiever in the class. The web interface handled everything I threw at it - YouTube URLs, a 625MB MP4, and that massive 1.65GB MOV file - right in a browser tab, no app required. The most impressive demonstration was the silent drone test video, which contains no audio and no context other than me standing in a yard waving my arms. Gemini not only figured out I was testing hand gestures for drone control but correctly deduced the drone was acting as the camera and was therefore invisible in the footage. I'm betting a fair number of humans - including, let's be honest, my neighbors - wouldn't have clocked that. It also successfully parsed my annealing video, identifying sections and specific verbal points, and understood the walk-and-talk well enough to note location and commentary topics.

Where Gemini stumbled was in the transition from video understanding to image generation. When I asked it to create a new YouTube thumbnail based on the video content and my existing style, it decided to invent a bearded man (not me, sadly) and spelled "FIRE" as "FCIRE." So close, and yet so far from thumbnail glory.

Then there's ChatGPT, which is a classic good-news-bad-news situation. The bad news: ChatGPT itself couldn't read YouTube links, and while it can theoretically process videos, they need to be under 500MB. Mine, of course, were not. The good news: pair it with OpenAI's Codex agent, and things get interesting. Codex read both local files, correctly identifying the drone test as "a backyard drone test shot." For the walk-and-talk MOV, it initially balked, then politely asked permission to install Python code and libraries for audio transcription. Once it did that, it understood the context perfectly. When Codex couldn't watch the YouTube stream directly, I asked it to download the video locally - and it automagically wrote a Python script, installed libraries, and invented impromptu video-downloading technology on the fly.

Creating a thumbnail required me to play go-between for Codex and ChatGPT. Codex chose a frame and wrote a prompt; ChatGPT generated the image. The result was better than Gemini's - it used my actual face and picked up on my color scheme (white, yellow, black) - but it made the aluminum bar into square tubing instead of flat material, placed Sharpie marks at wrong angles, and gave the bend a criminally sharp right angle. A few corrective prompts got it closer, though I still prefer doing thumbnails by hand.

Notable takeaways: both Gemini and the ChatGPT/Codex duo interpreted videos in about two to three minutes each - far less than the actual 15-minute runtime. Both correctly understood the silent drone test, which is genuinely impressive. Practical uses abound: Gemini time-stamped key thoughts in a CBS report about the OpenAI trial so I could click through, and I can definitely see using these tools to scan security footage or extract major points from long videos.

Overall, Gemini wins the solo act, while ChatGPT needs its Codex sidekick. Claude? Well, Claude is still great for vibe coding, which is apparently a thing now. Isn't it amusing that Gemini - named for twins - needs only one tool, while ChatGPT needs two? The universe has a sense of humor.

I Made Three AIs Watch My Videos So You Don't Have To - One Actually Paid Attention

News in your inbox.