Frankie Willard, Alexander Kumar, Caroline Tang
See Our GitHubRecent advances in the ability of object detection to accurately identify objects in real-time has facilitated new developments in computer vision, including in autonomous vehicles, facial recognition, and medical imaging. Given the abundance of aerial and satellite imagery available, object detection can be applied to overhead imagery to quickly and automatically map out regions, providing a geographical distribution of specified objects of interest. However, object detection models often require an abundance of diverse, high quality, and manually labeled imagery, which is expensive to filter and label. Thus, we generated synthetic images and their labels using image blending to reduce the costs of data collection and annotation. We employed the state-of-the-art GP-GAN model to create realistic synthetic imagery that integrated objects within many real, unlabeled background images. Our overhead imagery dataset was separated into its different geographical regions such that we performed experiments to evaluate the value of adding synthetic imagery in improving the accuracy and adaptability of object detection across different regions. Our results suggest that adding GP-GAN generated synthetic imagery to our baseline training dataset improves average precision from the baseline and slightly outperforms synthetic imagery created with 3D models of energy infrastructure.
Access to electricity is becoming increasingly critical, especially for promoting economic development, social equity, and improving quality of life. Further, it has been shown that electricity access is correlated with improvements in income, education, maternal mortality, and gender equality. Yet, worldwide, 16% of the global population, or approximately 1.2 billion people, still don’t have access to electricity in their homes. This map from the World Bank in 2017 highlights the uneven distribution of energy access, with the majority of those without electricity access concentrated in sub-Saharan Africa and Asia.
One of the first steps in improving energy access is acquiring comprehensive data on the existing energy infrastructure in a given region.
This includes information on the type, quality, and location so that energy developers and policymakers can then strategically and
optimally deploy energy resources. This information is key for helping them make decisions about where to prioritize development,
and whether they should use grid extension, micro/minigrid development, or offgrid options to bring electricity access to new communities.
However, this critical information for expanding energy accessibility is often unattainable or low quality.
One potential solution to this issue is to automate the process of mapping energy infrastructure in satellite imagery. Using deep
learning, we can input satellite imagery into an object detection model and make predictions about the characteristics and contents
of the energy structure in the region featured in the image, providing energy experts with the necessary data to expand electricity access.
Object detection consists of classification (identifying the correct object) and object localisation (identifying the location of a given object). Our project has a particular emphasis on object detection, as we seek to improve the detection of energy infrastructure in different terrains as a part of expanding energy access data. Object detection models analyze the scenery of photos and generate bounding boxes around each object in the image. In doing so, it classifies each object and assigns a confidence score based on the accuracy of its prediction. The model predicted that each of the green, yellow, orange, and pink boxes in the image on the left would indicate different objects, being a truck, a car, an umbrella, and a person. Based on examples provided to it, the model learns how to predict these boxes and classes. We refer to these labeled images as ground truth as they contain boxes that denote every object's class and the location within the image.
After training our object detection model, we can apply it to a collection of overhead imagery to locate and classify different energy infrastructure across entire regions. In our experiments, we test our ability to detect wind turbines to maintain consistency with previous experiments. While we could demonstrate energy infrastructure detection for any number of types of electricity infrastructure, wind turbines were chosen due to their relatively homogeneous nature as opposed to different power plants and other energy infrastructure. Additionally, our dataset was limited to the US, as there is a wealth of high resolution overhead imagery available throughout the US. Ultimately, the methods used to improve object detection of energy infrastructure will be expanded to more energy infrastructure and tested on more regions, however, limiting the the infrastructures to wind turbines and using readily available US imagery helps to quickly provide performance benchmarks for our real and synthetic datasets.
While the potential of object detection seems promising, it presents two main challenges. The first is that properly training the object detection model requires thousands of already labelled images. According to Alexey Bochkovskiy, developer of the highly used and precise YOLOv4 object detection model, it is ideal to have at least 2000 different images for each class to account for the different sizes, shapes, sides, angles, lighting, backgrounds, and other factors that could vary from image to image. Thus, in order to make the object detection model best generalize, the model will require 1000s of training images per energy infrastructure. Because many types of energy infrastructure are rare objects, obtaining and annotating such a large quantity of satellite images featuring these infrastructures manually is expensive in terms of both time and cost.
The second challenge we face is that in training an object detection model to detect energy infrastructure in certain regions, our training set and testing set must come from different locations and thus may have differences in geographical background and other environmental factors. Without being properly trained for the test setting, object detection models are not great at generalizing across dissimilar images yet. What this means is that if we train our model on a collection of images from one region, featuring images with similar background geographies, the model will then be able to perform fairly well on other images with those same physical background characteristics. However, if we then try to input images from a different region with different geographic characteristics, the model's performance becomes significantly worse.
Our proposed solution to address these two problems is to introduce synthetic images into our training dataset. The synthetic images supplement the original real satellite imagery dataset to create a larger dataset to train our object detection model, diversifying the geographical background and orientation of energy infrastructure that the model sees. We generate these synthetic images by cropping the energy infrastructure out of satellite images and using a Generative Adversarial Network to blend them into a real image without any energy infrastructure from one of the target geographic domains.
For five years, the Duke Energy Data Analytics Lab has worked on developing deep learning models that identify energy infrastructure,
with an end goal of generating maps of power systems and their characteristics that can aid policymakers in implementing effective
electrification strategies. In 2015-16, researchers created a model that can detect solar photovoltaic arrays with high accuracy [2015-16
Bass Connections Team].
In 2018-19, this model was improved to identify different types of transmission and distribution energy infrastructures,
including power lines and transmission towers [2018-19
Bass Connections Team]. Last year's project focused on increasing the adaptability of detection models
across different geographies by creating realistic synthetic imagery [2019-20
Bass Connections Team].
In 2020-2021, the Bass Connections project team extended this work, trying to improve the model’s ability to accurately detect rare objects in diverse
locations. After collecting satellite imagery from the National Agriculture Imagery Program database and clustering them by region, they experimented with generating synthetic imagery by taking
satellite images featuring no energy infrastructure and placing 3D models of the object of interest on top of the image, and capturing a photo that
mimicked the appearance of a satellite image [2020-21
Bass Connections Team]. Their paper, Wind Turbine Detection With Synthetic Overhead Imagery, was published in 2021 by IGARSS.
In our project, we build upon this progress and try to improve the 2020 Bass Connections team's ability to enhance energy infrastructure detection
in new, diverse locations.
Below is a description of the experiments we conducted to evaluate if adding synthetic images to an object detection algorithm enhances its performance across geographic domains. After gathering real images and generating synthetic images, we can construct two datasets. The first dataset includes only real imagery, while the second dataset includes both real and synthetic images. We can train an object detection model on the first dataset, test it, and then repeat the process with the second dataset, comparing the results. If the model performs better when trained on a dataset with synthetic imagery, we can conclude that the synthetic imagery aids the model's performance. Given the Bass Connections previous work creating a synthetic dataset using CityEngine, we can also compare our synthetic dataset’s performance against theirs, to determine which synthetic dataset best improves energy infrastructure detection.
Generative Adversarial Networks (GANs) are a method of generative modeling. The concept behind GANs is a zero sum game between two Neural Networks- a generator network and a discriminator network. While the generator attempts to create images that are as realistic as possible, the discriminator tries to determine if those images are real or fake. The generator then learns from what the discriminator identifies as fake, helping it create more realistic images. This novel approach to generative modeling has seen a rapid increase in usage in various scientific domains, due to its ability to generate photorealistic images for tasks including data augmentation, creating art, image to image translation, image harmonization, and image super-resolution.
In trying to generate synthetic imagery with real wind turbines in new terrains, our problem presented the need for image harmonization and image blending. This consists of matching the visual appearance/style of the wind turbine and geographical background images when blending them into a single image. Given the GANs state-of-the-art performance in “GP-GAN: Towards Realistic High-Resolution Image Blending” by Huikai Wu, et al., we chose to investigate the potential for GANs to realistically develop our synthetic imagery dataset.
The synthetic images produced by the GP GAN are a quick and cost-effective solution to labeled datasets that are incommensurate with the great training set size required for adequate performance. To produce our images, we simply need a few images with each energy infrastructure (already in training set) and background images. These images do not have to be labeled, and thus are much more abundant and available for easy use. We can then make simple crops and bounding boxes for sampled real wind turbines within minutes. Our automatic image augmenter then randomly samples sizes, locations, and rotations to generate hundreds of source images in seconds. Our generated sources (size = m) can be matched with any and all destinations (size = n) to create as much as m times n output images that are produced by the GP GAN. Each image is blended in approximately 7 seconds, making our data pipeline an incredibly quick and resource efficient solution for data scarcity. For reference, if we generate 10 source images per destination image and download 100 destination images, we can create a dataset that includes 1000 images in less than 2 hours that includes a diversity in wind turbine location, rotation, size, and geographical background. Below are some example synthetic images created from a variety of background images.
In designing the synthetic imagery, we must be careful in controlling environmental variables to generate a diverse dataset of images that are close to real images. These design considerations are critical components of our methodology as the closer the synthetic imagery is to the real test imagery, the more the synthetic imagery will improve our performance when adding it to our training set.
The first step of our image generation pipeline is our image augmenter, which leads us to consider the location, size, and rotation of our synthetic turbines. The previous Bass Connections team modeled the size distribution of the CityEngine turbines after the size distribution of the real turbines. To ensure a controlled experiment, we sampled their location and sizing information to match the real distributions and allow for fair comparison with the CityEngine dataset. While no rotation data was previously stored, we randomized the rotation of every windmill to allow the object detection model to become familiar with wind turbines from various angles and views.
Additionally, we had to choose which background images to have placed under our synthetic wind turbine models. In corresponding with the Bass Connections CityEngine experiments, we chose to use background imagery close to the real images in our testing set to maximize the similarity of our synthetic imagery with the target data. This methodology is consistent with real scenarios, as we will likely have access to unlabelled imagery or have the ability to collect unlabelled imagery from around the region we wish to test on for use as background images. Given the lack of manual labeling and filtering required as well as our ability to generate many sources to blend with each background image, this background data collection would ideally not be too time consuming. Using the background images close to our testing locations allows us to estimate the potential performance increase that the synthetic data can provide without introducing confounding variables such as a mismatch between the synthetic background image domain and the target domain (makes it difficult to attribute poor performance to the geographical background or synthetic data generation).
To evaluate the potential of synthetic imagery in improving the performance of object detection, we set up within and cross domain experiments, where a domain is defined as a specific geographic region. The source domain refers to the region that the real training data comes from, while the target domain refers to the region that the object detection model is applied to. These two types of experiments each correspond to a potential real-world situation one might encounter, and help us to evaluate the potential performance of the object detection model in each of these situations.
In the context of energy access planning, the ultimate goal of this project is to utilize object detection in various regions of the world where energy access is extremely limited and information on existing energy infrastructure is not readily available. Thus, the object detection model must be able to generalize well across different images despite labeled real satellite imagery most likely being limited.
Our within-domain experiments, where the source and target domains are within the same geographic region, will help us to evaluate the potential for synthetic imagery to supplement limited real training data.
However, as mentioned previously, one of the key challenges that object detection presents is its poor performance when applied to data that looks significantly different from the data on which it is trained. Thus, the cross-domain experiments reflect the potential situation where there exists no data at all from the target domain, and thus the object detection model must be trained on data from an entirely different region. For this experiment, the synthetic data that is used will come from the target region, but the real images will come from a source region, different from the target.
In constructing our experimental datasets, we need to figure out what ratio of real to synthetic data yields the largest gain in performance (if any). Adding too much synthetic data could lead to overfitting to synthetic data and any irregularities within the synthetic data or differences with regular images would be exacerbated such the object detection model may perform worse. However, adding too little synthetic data will have a negligible effect on performance. The 2020-2021 Bass Connections team designed an experiment in which they tested ratios of 1:0, 1:0.5, 1:0.75, 1:1, and 1:2 real to synthetic ratios. After conducting these experiments, they found that 1:0.75 yields the greatest performance as measured by average precision. Therefore, to maintain similarity to the Bass Connections team, we design our experiments using the 1:0.75 ratio. This ratio allows for fair comparison of our synthetic data generation with the Bass Connections team, however, in the future, we would like to experiment with different ratios with our synthetic data generation process to find the optimal ratio.
Having sampled our data and found the optimal real to synthetic ratio, our final datasets for each region is:
For each of the domains we selected, we ran the baseline and modified experiments, where all of the data came from the same region. This experiment helps to evaluate the overall ability of synthetic imagery (especially using our GP-GAN technique) to improve the object detection performance.
For these experiments, the domains for the real source and target images are different, while the synthetic images used in the modified training dataset are from the target region. Thus, with synthetic images more similar to the target region, we hypothesize that the addition of the synthetic images will improve the accuracy of the object detection when the target and source regions are dissimilar in appearance. These experiments will help us to evaluate the potential for synthetic imagery to improve the object detection model’s ability to generalize across different regions despite the limitations of the existing training data.
YOLOv3 is a popular object detection model used in various computer vision tasks. YOLO stands for You Only Look Once, as the model is only applied to an image once, dividing the image into regions and predicting bounding boxes for each region in the image. It is widely used because of its much faster object detection speed with similar mAP as other well-performing models. This speed is important for our task as ultimately, we hope to automate the mapping of satellite imagery to energy infrastructure, which will require the model to quickly identify infrastructure in large datasets including imagery of entire regions. Our previous Bass Connections team also used YOLOv3, such that we used YOLOv3 to make direct comparisons between the performance of our models and their's without confounding variables.
To understand our results, it's critical that we first understand the metrics that we have chosen to measure performance. The primary metrics we will use is Average Precision, which combines the classification metrics of precision and recall. We will explain the implication of these metrics starting with the images on the left.
Now we plot the values of precision and recall of the model's predicted outputs on a graph, which is known as a precision-recall curve. On the curves to the right, it is evident that that as precision increases, recall decreases, and vice versa. There is hence a tradeoff between precision and recall. However, we would like to have high values for both precision and recall, which means we would like the area under the precision-recall curve to be as high as possible. A metric that quantifies this area is Average Precision (AP), and thus summarizes the precision-recall curve and rewards models with a high precision and recall.
In the machine learning space, small absolute increases in AP denote a significant improvement in model performance.
Due to variability and stochasticity in the object detection model’s training process, there will be slight variations between the results of each run, as shown on the left image. Each experiment is therefore repeated 4 times to account for this randomness and improve the accuracy of the result. The average AP value is calculated and used to compare results of our baseline model, model with added CityEngine images, and model with added GP GAN images.
The performances of the model with added synthetic images improve significantly in both within-domain and cross-domain settings. Synthetic images are especially helpful in cross-domain settings, which means they can be useful when there is a lack of data or when it is cost-prohibitive to collect data of the target domain.
Here we will present a closer look into the results of training with real images from each of the 3 geographic regions respectively. There is a disparity in performance when the model is trained with real images of different geographic domains. In particular, in cross-domain experiments that test on Eastern Midwest, the model performs generally worse than when testing on other regions.
As shown above, the model performs consistently worse in the cross-domain experience. However, the model has the greatest average improvement in average precision from the addition of the GP GAN in these same cross-domain experiments, improving the overall cross-domain performance by 31% from the baseline. In fact, the effect of the GP GAN is greatly noticed when considering the worst performance of each dataset. The GP GAN's worst performance of 0.638 Average Precision on the Train EM Val NE experiment is much greater than the other models worst performances, providing a sharp increase in performance. Thus, it provides promise for bridging the gap in cross-domain experiments for different geographic regions.
The results show that adding the curated GP GAN generated imagery improves the performance of our object detection model in all cases. This is especially the case in cross domain experiments (testing on an unseen region). The performance increase is more limited in the within domain setting, where there the model is testing on a previously seen region and was already generally performing well. Furthermore, our model not only improves upon the baseline, but also the synthetic CityEngine dataset, demonstrating its ability to outperform other methods of synthetic image generation, especially in cross-domain experiments. Given that our method of synthetic image generation is free and quick to produce, it evidently presents a simple and effective method of enhancing object detection model performance on new domains. Furthermore, it can serve to supplement datasets that simply we lack training data, which is often the case when we are trying to obtain information on energy infrastructure. With the aid of our synthetic imagery, this method of identifying and gathering locations of energy infrastructure in a geographic region could bridge the information gaps that energy access planners need when making decisions about electrification.
We would like to thank Dr. Kyle Bradbury, Dr. Jordan Malof, and Wayne Hu for their help and guidance along the way. We would also like to thank the previous Bass Connections and Data+ teams for their work leading up to this project. Additionally, we would like to thank Dr. Paul Bendich and Dr. Greg Herschlag for their work organizing and hosting the Duke Data Plus talks, and the speakers who shared their wisdom about the field of data science. Thank you to the Duke Data Plus and Duke Energy Initiative that supported this project.