Conclusions

In this study, an image-based method for detection of polluting plumes from dense urban environments was presented. The approach fitted a regional convolutional neural network to continuously sampled images for New York city's skyline.  We showed the mean average precision of detecting plumes is 61.66 % on the testing set. 
During the labeling of the training set, it was found that the temporal context was effective for the disambiguation of plumes and other plume-like patterns. It is believed that the model performance would improve by integrating that contextual information through the use of a 3D CNN, such as Region Convolutional 3D Network (R-C3D) (\citealt{saenko2017}), which can be used to extract spatiotemporal features capturing activities, accurately localizing the start and end times of each plume.

Supplementary materials   

Methodology (Technical description)

Background subtraction and statistical heuristics

Plumes are difficult to identify in the original images because there are many other distracting features in the cityscape. To control for this, background subtraction methods (Figure \ref{724398}), were employed in order to remove stationary objects from the frame. This works as a noise attenuation method which both subtracts out irrelevant features for detecting plumes and removes a majority of the features that are specific to this cityscape and makes the model more generalizable.