Neural City

3D modeling is tedious work. Creating the volumes, texturing the surfaces, laying out and testing the lights is all time-consuming, and rendering can eat up hours just for a single frame. What if it didn’t have to be that way?

I think machine learning has something to offer here, specifically a process called pix2pix. In layman’s terms, pix2pix looks at an image and produces a new image, pixel by pixel, based on how it was trained. For example, a pix2pix model can be trained to produce a picture of a flower after being fed a line drawing of a flower. This probably exists already. The way to train said model would be to show it thousands of line drawings of flowers and thousands of corresponding photos of flowers. The more drawings and photos, the more accurate it becomes. Perhaps this could work with some low poly renderings instead of line drawings.

You can try this! This model turns line drawings into cats.

NVIDIA has been developing a couple of different pix2pix implementations that do this at higher resolutions and over video. They have also been using a particularly impressive dataset called the Cityscapes dataset which can be accessed by anyone for free. This particular dataset has thousands of images pulled from dashcam footage of various locations in Germany. What makes this dataset especially powerful is the categorization images that accompany all of the photos.

Each entity in the image has been categorized in the Cityscapes dataset. Seen here is the labeling image overlaid on the actual photo.

pix2pix has been using these images to begin to synthesize photorealistic images based entirely off the labeling images and getting some impressive results. I was able to actually replicate their results by using their algorithm and training my own model with the same dataset.

This is the test image I used. It’s a typical label image.
This is the resulting output image. Its low resolution (I did this on my PC at home), but at a glance, it looks realistic!

Once, I trained my model, which took about 24 hours to get decent results, I wanted to test its limits. What was it capable of? What were its weaknesses? Not much and a lot apparently. I thought that if it built the correct associations of objects to colors, I could just output all kinds of colorful images from MS Paint and everything would look right, right? Wrong. The image above looks great, but that’s because the label image it came from was part of the training set. In other words, if you were taking a math test and saw that one of the questions on the test was in the homework, you’d know the answer from the homework not necessarily because you know how to do math.

Test 1

This input image was an attempt to make an image that looks like something the model is expecting. It was made using Neubauwelt’s invaluable vector graphics collection.
The resulting image looks pretty good! The Taichi lady looks like a yeti, but everything looks kind of like it’s in the real world.

The results were very satisfying. It understands that it’s an image of a tree-lined street full of people with a couple of cars parked in very precarious positions. Let’s push it a bit further.

Test 2

What if we got rid of the cars and just had people standing around? … and just a little patch of grass in the middle of the road?
Some of these people are looking less like people, but the roads have shadows and the trees have leaves.

Test 3

I forgot to add signs. Let’s add some signs.


More signs = more weird.


Test 4

Maybe everyone can just be standing in a park.
This looks completely unrecognizable. The input image has clearly confused the model. Maybe its because there’s no road in it.

Test 5

Does it need everything look like a road receding in the distance? What about an elevation?
Not as bad as the park, but also not great. Trees look good, people look good and the cars also look okay. The building, on the other hand, looks terrible!

After seeing that there were some serious limitations to my model, I thought making a video would probably be the easiest way to see what it did well and what it did poorly. I learned that videos are also something it does poorly! I knew going in that vid2vid might have been a better choice for making a video, but I didn’t have the computing power/time to mess with that one.

I very quickly modeled a scene in 3DSMax and output a series of short clips and put together this video.

So it’s far from perfect. And each frame is treated as a completely new frame, so people walking and cars driving look like their flickering TV sets. But I have to say, this is still pretty incredible. Once I trained the model and figured out how to use it, the 3D modeling took less than an hour and the rendering took maybe an hour despite the fact that it was outputting flat color block images. Feeding the frames to the model ended up taking less than a minute!

I can see a not so distant future where texturing, lighting and rendering will really be handled by machine learning algorithms with people focusing primarily on the design work. Until then, however, doing it manually is still preferable.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.