About a year ago, my coauthors and I published a huge dataset of more than a million annotated images of animals from a camera trap network in the Serengeti. The lead author, Dr. Swanson, and I are both early career scientists, and we both put a ton of time and effort into this dataset. We made the decision to publish the dataset as its own product after more than a half-dozen researchers in other fields (computer vision, citizen science, education) contacted us to ask if they could use our data. Our graduate advisor (and PI-on-paper) wondered whether this was a good idea. If we published the data, he worried, other people could take it and do the sorts of community ecology research that we were hoping to do with it.
I’ve heard this worry a lot about open data. I’ve had this worry myself as a grad student. But as far as I can tell, having made this dataset (and others) available, is that the worry about being scooped is way overblown for most ecology datasets. That doesn’t mean it can’t or doesn’t happen. But I think it’s a rare case when it does. (Can anyone point to a time it’s happened?) Instead, opening up the data has meant two great things. First, when people contact us about our data and camera trap network (which happens monthly), we can just point them to the dataset and it saves us a ton of time. Second, there are ecologists using our data in ways we never imagined, including looking at community ecology in groups of animals we don’t (small mammals, lizards versus large mammals) and investigating wildlife disease.
Open data is great!
But. (You knew there was going to be a but.) Here’s something I haven’t heard proponents of Open Science talking about much. If you publish a dataset, you pretty much lose control over authorship.
Traditionally, the way data in ecology worked (and still mostly works) is that you go through a lot of effort to create a dataset. Then you keep it. Hopefully you’re smart and you back it up and have other safeguards to make sure it doesn’t get compromised. But usually it just sits on your desktop computer somewhere. Then people find out about your data. Probably you published something. Maybe sometimes through word of mouth. And if people want to use your data, they contact you and say, “hey, I have a great idea for an analysis and paper that needs your data. Can we collaborate?” Often this is code for, “if you give me your data, I’ll give you co-authorship on the resulting publications.”
And there’s a reason for this customary tit-for-tat. Producing ecological datasets is far from trivial. It’s also nice to know who is using your data and for what. As a data-creator, you want to make sure your data is not misused. Not only do you care about the science coming out right, but because your reputation is attached to the data, a misuse reflects poorly on you, even if it’s done by someone else.
The LTER network has an explicit data policy that reads, “The Data Set has been released in the spirit of open scientific collaboration. Data Users are thus strongly encouraged to consider consultation, collaboration and/or co-authorship with the Data Set Creator.” Not too long ago this policy was on a site-by-site basis and — at least for the sites that I used data from — contacting the data creator was a requirement for publishing using existing data.
For early career researchers, there’s a super important reason for this custom of co-authorship when re-using data. Number of publications matters. It just does. If I have spent some sizeable fraction of my nascent career on developing a particular dataset, I need to get credit when that dataset is used for advancing science. And the truth is that number of publications counts way more than number of citations.
So here’s the problem: anyone can use data from our big published dataset (please do!), and they will be right and proper to simply cite it. If we hadn’t published the dataset, then people would have to contact us about collaborating and my coauthors and I could rack up more publications. Perhaps the data would be used less overall, because it’s a bit more effort to exchange a few emails than to simply download a dataset. The crucial point is that Open Data may be good for science, but it may be bad for scientists — especially early career ones. Not because the authors of open data will be scooped, but because the authors lose credit for their data relative to authors who do don’t make their data open.