On the day that the Inside EPA report came out, an email from O’Brien popped up on my phone with “Red Fucking Alert” in the subject line. “We’re archiving everything we can,” he wrote...
At 10 AM the Saturday before inauguration day, on the sixth floor of the Van Pelt Library at the University of Pennsylvania, roughly 60 hackers, scientists, archivists, and librarians were hunched over laptops, drawing flow charts on whiteboards, and shouting opinions on computer scripts across the room. They had hundreds of government web pages and data sets to get through before the end of the day—all strategically chosen from the pages of the Environmental Protection Agency and the National Oceanic and Atmospheric Administration—any of which, they felt, might be deleted, altered, or removed from the public domain by the incoming Trump administration.
Their undertaking, at the time, was purely speculative, based on travails of Canadian government scientists under the Stephen Harper administration, which muzzled them from speaking about climate change. Researchers watched as Harper officials threw thousands of books of aquatic data into dumpsters as federal environmental research libraries closed.
But three days later, speculation became reality as news broke that the incoming Trump administration’s EPA transition team does indeed intend to remove some climate data from the agency’s website. That will include references to President Barack Obama’s June 2013 Climate Action Plan and the strategies for 2014 and 2015 to cut methane, according to an unnamed source who spoke with Inside EPA. “It’s entirely unsurprising,” said Bethany Wiggin, director of the environmental humanities program at Penn and one of the organizers of the data-rescuing event.
Back at the library, dozens of cups coffee sat precariously close to electronics, and coders were passing around 32-gigabyte zip drives from the university bookshop like precious artifacts.
The group was split in two. One half was setting web crawlers upon NOAA web pages that could be easily copied and sent to the Internet Archive. The other was working their way through the harder-to-crack data sets—the ones that fuel pages like the EPA’s incredibly detailed interactive map of greenhouse gas emissions, zoomable down to each high-emitting factory and power plant. “In that case, you have to find a back door,” said Michelle Murphy, a technoscience scholar at the University of Toronto.
Murphy had traveled to Philly from Toronto, where another data-rescuing hackathon had taken place a month prior. Murphy brought with her a list of all the data sets that were too tough for the Toronto volunteers to crack before their event ended. “Part of the work is finding where the data set is downloadable—and then sometimes that data set is hooked up to many other data sets,” she said, making a tree-like motion with her hands.
At Penn, a group of coders that called themselves “baggers” set upon these tougher sets immediately, writing scripts to scrape the data and collect them in data bundles to be uploaded to DataRefuge.org, an Amazon Web Services-hosted site which will serve as an alternate repository for government climate and environmental research during the Trump administration. (A digital “bag” is like a safe, which would alert the user if anything within it is changed.)
“We’re yanking the data out of a page,” said Laurie Allen, the assistant director for digital scholarship in the Penn libraries and the technical lead on the data rescuing event. Some of the most important federal data sets can’t be extracted with web crawlers: Either they’re too big, or too complicated, or they’re hosted in aging software and their URLs no longer work, redirecting to error pages. “So we have to write custom code for that,” Allen says, which is where the improvised data-harvesting scripts that the “baggers” write will come in.
But data, no matter how expertly it is harvested, isn’t useful divorced from its meaning. “It no longer has the beautiful context of being a website, it’s just a data set,” Allen says.
That’s where the librarians came in. In order to be used by future researchers—or possibly used to repopulate the data libraries of a future, more science-friendly administration—the data would have to be untainted by suspicions of meddling. So the data must be meticulously kept under a “secure chain of provenance.” In one corner of the room, volunteers were busy matching data to descriptors like which agency the data came from, when it was retrieved, and who was handling it. Later, they hope, scientists can properly input a finer explanation of what the data actually describes.
But for now, the priority was getting it downloaded before the new administration got the keys to the servers next week. Plus, they all had IT jobs and dinner plans and exams to get back to. There wouldn’t be another time.
Bag It Up
By noon, the team feeding web pages into the Internet Archive had set crawlers upon 635 NOAA data sets—everything from ice core samples to “radar-derived coastal ocean current velocities.” The “baggers,” meanwhile, were busy finding ways to rip data from the Department of Energy’s Atmospheric Radiation Measurement Climate Research Facility website.
In one corner, two coders were puzzling over how to download the Department of Transportation’s Hazmat accidents database. “I don’t think there would be more than a hundred thousand hazmat accidents a year. Four years of data for fifty states—so 200 state-years, so…”
“Less than a 100,000 in the last four years in every state. So that’s our upper limit.”
“It’s kind of a macabre activity to be doing here—sitting here downloading hazmat accidents.”
At the other end of the table, Nova Fallen, a Penn computer science grad student, was puzzling over an interactive EPA map of the US showing facilities that violated EPA’s rules.
“There’s a 100,000 limit on downloading these. But it’s just a web form, so I’m trying to see if there’s a Python way to fill out the form programmatically,” said Fallen. Roughly 4 million violations filled the system. “This might take a few more hours,” she said.
Brendan O’Brien, a coder who builds tools for open-source data, was deep into a more complicated task: downloading the EPA’s entire library of local air monitoring results from the last four years. “The page didn’t seem very public. It was so buried,” he said.
Each entry for each air sensor linked to another set of data—clicking each link would take weeks. So O’Brien wrote a script that could find each link and open them. Another script opened the link, and copied what it found into a file. But inside those links were more links, so the process began again.
Eventually, O’Brien was watching raw data—basically, a text file—roll in. It was indecipherable at first, just a long string of words or numbers separated by commas. But they began to tell a story. One line contained an address in Phoenix, Arizona: 33 W Tamarisk Ave. This was air quality data from an air sensor at that spot. Beside the address were number values, then several types of volatile organic compounds: propylene, methyl metacrylate, acetonitrile, chloromethane, chloroform, carbon tetrachloride. Still, there was no way to tell if any of those compounds were actually in the air in Phoenix; in another part of the file, numbers that presumably indicated levels of air pollution were sitting unpaired with whatever contaminant they corresponded to.
But O’Brien said they had reason to believe this data was particularly at risk—especially since the incoming EPA administrator Scott Pruitt has sued the EPA multiple times as Oklahoma’s Attorney General to roll back the agency’s more blockbuster air pollution regulations. So he’d figure out a way to store the data anyway, and then go back and use a tool he built called qri.io to pull apart the files and try to arrange them into a more readable database.
By the end of the day, the group had collectively loaded 3,692 NOAA web pages onto the Internet Archive, and found ways to download 17 particularly hard-to-crack data sets from the EPA, NOAA, and the Department of Energy. Organizers have already laid plans for several more data rescue events in the coming weeks, and a professor from NYU was talking hopefully about hosting one at his university in February. But suddenly, their timeline became more urgent.
On the day that the Inside EPA report came out, an email from O’Brien popped up on my phone with “Red Fucking Alert” in the subject line.
“We’re archiving everything we can,” he wrote.
Date of Publication: 01.19.17
Time of Publication: 9:00 am
original story HERE
Share This Blog Post: If you would like to share this blog post, go to the original shorter version of this post and look to lower right for the large green Share button. Ask them to sign up too for the Global Warming Blog.