3D modeling is tedious work. Creating the volumes, texturing the surfaces, laying out and testing the lights is all time-consuming, and rendering can eat up hours just for a single frame. What if it didn’t have to be that way?
I think machine learning has something to offer here, specifically a process called pix2pix. In layman’s terms, pix2pix looks at an image and produces a new image, pixel by pixel, based on how it was trained. For example, a pix2pix model can be trained to produce a picture of a flower after being fed a line drawing of a flower. This probably exists already. The way to train said model would be to show it thousands of line drawings of flowers and thousands of corresponding photos of flowers. The more drawings and photos, the more accurate it becomes. Perhaps this could work with some low poly renderings instead of line drawings.

NVIDIA has been developing a couple of different pix2pix implementations that do this at higher resolutions and over video. They have also been using a particularly impressive dataset called the Cityscapes dataset which can be accessed by anyone for free. This particular dataset has thousands of images pulled from dashcam footage of various locations in Germany. What makes this dataset especially powerful is the categorization images that accompany all of the photos.

pix2pix has been using these images to begin to synthesize photorealistic images based entirely off the labeling images and getting some impressive results. I was able to actually replicate their results by using their algorithm and training my own model with the same dataset.


Once, I trained my model, which took about 24 hours to get decent results, I wanted to test its limits. What was it capable of? What were its weaknesses? Not much and a lot apparently. I thought that if it built the correct associations of objects to colors, I could just output all kinds of colorful images from MS Paint and everything would look right, right? Wrong. The image above looks great, but that’s because the label image it came from was part of the training set. In other words, if you were taking a math test and saw that one of the questions on the test was in the homework, you’d know the answer from the homework not necessarily because you know how to do math.
Test 1


The results were very satisfying. It understands that it’s an image of a tree-lined street full of people with a couple of cars parked in very precarious positions. Let’s push it a bit further.
Test 2


Test 3


Test 4


Test 5


After seeing that there were some serious limitations to my model, I thought making a video would probably be the easiest way to see what it did well and what it did poorly. I learned that videos are also something it does poorly! I knew going in that vid2vid might have been a better choice for making a video, but I didn’t have the computing power/time to mess with that one.
I very quickly modeled a scene in 3DSMax and output a series of short clips and put together this video.
So it’s far from perfect. And each frame is treated as a completely new frame, so people walking and cars driving look like their flickering TV sets. But I have to say, this is still pretty incredible. Once I trained the model and figured out how to use it, the 3D modeling took less than an hour and the rendering took maybe an hour despite the fact that it was outputting flat color block images. Feeding the frames to the model ended up taking less than a minute!
I can see a not so distant future where texturing, lighting and rendering will really be handled by machine learning algorithms with people focusing primarily on the design work. Until then, however, doing it manually is still preferable.