Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the order of hundreds to low-digit thousands worked well for us. These photos contain a lot of occluders like tourists, and we needed to have enough views of the subject in question to build a good 3D scene representation.



can you elaborate on the key variables for the data? for instance, is it safe to assume 360 photos from the same angle would yield a worse model than 1 photo from 360 different angles?

what does the ideal minimal data set look like (eg, 5 photos from each 15-degree offset)?

thanks for being so active on this thread.


NeRF's (and all of photogrammetry's) bread and butter is 3D consistency -- that is, seeing the same object from multiple angles. A 360 degree photo from a fixed position just won't do. As to how to select the best camera angles...I'm not sure. I believe there is research in this area for classical photogrammetry techniques, but I'm not familiar enough to point you to a body of work.


How do you remove tourists? Is the network trained to segment and ignore humans?


The model does not explicitly learn to segment images. The answer is unfortunately more difficult to explain than a HN comment bears. I encourage you to read the paper for more details.

https://arxiv.org/abs/2008.02268


Just gotta say: amazing!

My follow up question would be: are you able to compare your results to actual photogrammetry data to see how good your reconstruction performs?


I'm actually quite new to the field, and I'm not even sure what to compare against nor how to compare it. What's typically measured and how?


Is the model able to capture the underlining geometry? E.g. If I have a pillar part of which was not visible at any training point is it able to reconstruct that part?


The model is trained to reconstruct what is observed, but not what is obscured. If you look closely at our videos, you'll notice some parts of the scene are blurry -- those parts weren't seen often enough to learn well. If you look at parts of the scene not observed at all, I'm not sure what you'd find.


would a sufficiently long video in motion, say from a drone, car or even a walking person, work instead?


Pictures are pictures, even as video frames :)


Did you consider using movies as a source too?


Consider? Yes. Try? Nope!


awww! Figured dolly shots and steady cam shots would fit perfect into something like that. Esp 24 frames per second and usually known locations. Course it would probably drag a lot of the net into being biased towards that time spot I guess?


There are problems associated with using video: motion blur, rolling shutter.


Oh I agree. In my head it seems like it should work. I could be wildly wrong though. I am every day :)


It definitely can work, and even has some additional benefits (1), but requires special considerations. You can deblur using global motion vectors (2), or additional hardware like accelerometer reading embedded in the video feed (3).

1) cant find the paper now, but by exploiting predictable rolling shutter you get additional temporal resolution

2) http://users.ece.northwestern.edu/~sda690/MfB/Motion_CVPR08....

3) http://neelj.com/projects/imudeblurring/imu_deblurring.pdf




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: