I don't think it's necessarily so simple, and I think the multi-frame case blows...

I don't think it's necessarily so simple, and I think the multi-frame case blows up the complexity of the algorithm.

In the single frame case, each pixel is a 0 dimensional point, and for each pixel, you evaluate the 3 adjacent pixels in the row above. Then to find the seam you just pick the lowest energy pixel at the edge of the frame and follow the thread up. So the total runtime of this algorithm is O(pixels).

In the multi-frame case, if you want the video to be totally smooth, you have to think higher-dimensionally.

In the multi frame case, each seam is a 1 dimensional list of pixels, and for each seam, you evaluate the $HUGE_NUMBER of adjacent seams in the previous frame.

That is, the 2D case's runtime is proportional to the image height times the number of adjacent pixels, which is a small constant (3). In the 3D case, the runtime is proportional to the number of frame times the number of adjacent seams, which is massive.

I can imagine some heuristic optimizations that would allow you to guide your search, though. For instance, you could significantly downsample the video in both pixels and framerate and solve that, and then use that low-resolution solution to constrain and prune an approximate high-resolution search.