How machine learning and open source data can dramatically improve the quality of low-cost sensor output.

BY Drew Hill, MPH, PhD
November 06, 2018

Low-cost sensors: a true “disruption”

Understanding the quality of the air we breath once required specialized field skills, large semi-permanent sample staging, and a hefty budget (often $10,000+ per monitor!). Today, low-cost sensors are changing the way we think about air quality assessment. While academics have been cleverly adapting inexpensive technologies like smoke detectors to estimate air quality on a budget since at least the early 2000’s, a new family of affordable ready-made sensors — like the Purple Air, Foobot, and Clarity Node — are rapidly emerging as a boon to consumers, governments, and environmental consultants. Today, it seems that all you need is a $200 handheld gadget and an internet connection to get real-time air quality estimates right on your smart phone. But, as someone’s gran-dad somewhere always said, “there’s no such thing as a free lunch.”

No free lunch

Low-cost sensors often measure proxies of pollutant concentrations — like the amount of light (measured in voltage) that is scattered (by particulate pollutants) when a beam is sent through a parcel of air — rather than the actual parameter of interest. They’re also able to keep their small form factor and low prices by using lower grade hardware that allows more noise to infiltrate the measurement signal. Perhaps most importantly, the signal of many low cost sensors can be heavily affected by constantly changing factors like temperature, humidity, and the complex characteristics of the pollution being measured. That scattering light beam, for example, will scatter less light from darker pollution than it will from the same concentration of lighter pollution because more of that light will be absorbed (think wearing a dark shirt on a bright sunny day). For the most part, manufacturers and academics are making great strides in accounting for signal noise and temperature effects, but accounting for real-time changes in pollution properties, especially for particulate matter (PM), has remained a substantial hurdle. A new approach using carefully selected publicly available information on relevant environmental changes and data analytics may help us clear this hurdle.

We can (and have begun to) overcome this hurdle!

Subject matter experts (environmental consultants, academics, government officials, etc.) can tell you what sort of high-level events are going to generally affect local air pollution characteristics. For example, up-ticks in traffic, nearby wildfires, major power outages, and weather will all impact the contributors to and the makeup of local pollution. They will also tell you that trends in air quality measured several miles upwind are likely to provide insight on nearby air quality. However, the relationships between these factors and the pollution characteristics that affect low-cost sensor output are highly complex.

In an analysis recently presented at the Air Sensors International Conference in Oakland, CA (click here for the full presentation!), we demonstrated that ensemble machine learning methods (random forests and support vector machines alongside GLM and GLM net) can be used to account for sensor-relevant changes in local pollution composition to substantially improve the quality — and actionability– of low-cost sensor data. We also showed that information on such high-level events can be captured with reasonable spatial and temporal resolution from freely available (mostly government) data repositories using open source products like R and R Open Sci.


Uncalibrated PM2.5 data from Plantower sensors inside five low-cost air quality monitors– Clarity Nodes, the manufacturer of which does offer a calibration service, which we did not use here — are plotted against co-located regulatory grade PM2.5 monitors in the image below. A perfectly representative low-cost sensor would produce dots that fall directly along the black line — this black line is where government monitor concentrations are equal to the low-cost sensor estimates. What we see, however, is a consistent over-estimation of pollution concentrations (the blue line, or central tendency of the low-cost sensor data, does not follow the black line, or the central tendency of the government monitor) and some noise (the amount of scatter around the blue line).

Processing these data using a combination of ensemble machine learning methods and features produced from open source data produced pollution estimates with dramatically less bias (made the sensor output better match the government monitor output!) for every device. Other than a few outliers, our model allowed us to reproduce “true” concentrations (the government monitor’s measurements) with noticeably less bias than the original sensor estimates and potentially less error, but more analysis is needed. 10-fold cross-validated predictions are plotted against “true” concentrations in the colorful image below. (note: do let me know if you want more stats — I can draw some more-refined RMSE, etc. info if there is interest)

What’s the big deal?

We used predictions from a machine learning model to reliably estimate “true” air pollution concentrations at many points in space and time using low-cost sensor output, local weather data, satellite wildfire estimates (see the presentation!), and a few other open-data-based variables with a free ensemble machine learning statistical package … oh, and it was done on a laptop.

I believe that this targeted multi-source data analytics framework can be used to turn the glut of data collected by consumers around the world over the last several years and openly published by manufacturers (e.g., Purple Air) into research-grade information on environmental air pollution and human exposures.

With the right finesse and collaboration among government agencies, some improvements to their data transfer processes [note: the US Dept. of Commerce has already mode some good progress here]), and a bit of elbow grease, the multi-source data analytics framework might also be expanded to convert low cost sensor output into more-reliable air quality data in real-time without substantially expanding cost or resource outlay beyond the price of the low-cost sensor itself.

The possibilities are very exciting. I feel privileged to have the chance to work on these sorts of issues in my current job, and look forward to sharing more updates and insights on the state of the field!


Co-authors on the presentation were Ajay Pillarisetti and Kirk Smith of the University of California, Berkeley and Shari Libicki of Ramboll (formerly Environ). Several months of data from Clarity Node devices were generously provided by Clarity (while they do provide a sophisticated calibration service, we only used uncalibrated data). Clarity also provided co-located regulatory grade monitoring data from the San Joaquin Valley Air Pollution Control District and the Bay Area Air Quality Management District.

Edits: 1) Article written as an employee of SHAIR and Ramboll. 2) An earlier version suggested co-located regulatory data were provided to us by the San Joaquin Valley Air Pollution Control District and the Bay Area Air Quality Management District, however Clarity procured and generously provided us with the data.

This article previously appeared as a Linked-In article by one of our team mebembers on approximately the date noted on this blog post.

Share the post