3 Reasons Why Big Data Should Not Be Open Data

big-data-ictforaghttps://i0.wp.com/www.ictworks.org/wp-content/uploads/2016/08/big-data-i... 200w" sizes=" 640px) 100vw, 640px" data-recalc-dims="1" />
Recently, Bill Kedrock wrote that Big Data Needs to be Open Data, based on a presentation by Dr. Debisi Araba at MERLTech 2016. Bill concluded by calling for donors and others to assist countries like Nigeria as they grapple with the opportunities and challenges of open data.
Since the post, the USAID Food Security Strategy was published and says:
“Open and accessible data are essential assets that provide a foundation of evidence for scientists and decision-makers globally and help fuel entrepreneurship, innovation, and scientific discovery in food security and nutrition.”
In response to both, Michelle Kaffenberger, collaborated with Bill to consider the risks associated with making big data open. They propose that when considering whether big data should be made open, decision makers should apply a litmus test including at least the following three questions, and likely many others according to the specific context of the data.
This post uses smallholder farmer data collected through the Nigerian Growth Enhancement Support Scheme as the backdrop for a set of initial principles to guide those weighing the pros and cons of opening big data.
1. What level of personal information is contained in the data?
Big data, which can include call data records, GPS coordinates, and mobile money transactions contains an incredible amount of personal information. The smallholders database in Nigeria, for example, contains sensitive data points including the farmer’s national ID number (the equivalent of a social security number), e-money transaction data, bank account ownership, size of the farm, and GPS coordinates. Making this level of personal information publicly available is poor practice in any context—would you want the world to see your bank account transactions?
In addition, even if the data is “anonymized” by removing data points that directly identify someone (such as name, identification number, and GPS coordinates) it is now notoriously easy to determine an individual’s identity based on correlations. The ability to mine anonymized data for individuals’ information suggests that:

  • Extreme care must be taken to ensure enough information is removed from the data set during the anonymization process that an individual cannot be personally identified, and
  • Big data should be released only at an aggregated level, permitting analysis of aggregate consumer data at a district or county level, but not at the individual level.

2. Are the individuals comfortable with their data being shared publicly? Have they consented to sharing, and if so, do they understand what they consented to?
If data will be shared openly, consumers should have the option to opt-in (or not) through an informed consent process. In this way, the default is that their information is kept private, unless they make an active, intentional decision for it to be shared. CGAP research shows that providing simple explanations of data and its potential uses helps consumers make informed decisions about when they share.
Looking at our smallholder farmer data from Nigeria, there was no request to opt-in to data sharing, and no consent sought from farmers. Without their explicit consent, including an adequate explanation of how their data could be used (such as for targeted marketing), the farmers’ data should not be shared.
3. How will the data be used? How will this affect the poor?
In Kenya, for example, the use of big data to assess creditworthiness and extend loans has been growing rapidly, with mixed outcomes. Many customers have benefited from increased access to credit.
At the same time however, credit offers are often promoted through blast SMS which tempt customers to take out loans they don’t need, with poor disclosure of costs and fees, and with annual percentage rates that are in excess of 200 percent. Without proper safeguards, this easy access to credit—which the consumer may or may not have sought in the first place—can lead to a cycle of debt.
For smallholder farmers, it is common to need loans to buy fertilizer and seeds at the beginning of the season that can then be repaid several months later after harvest. Big data-driven digital loans, as currently offered, would not meet this need—the digital loans are typically small and have short repayment periods (seven to 30 days).
Yet, once data is publicly available to banks or other providers, there is a temptation to push a product even where there is a mismatch, leaving the borrower exposed to unneeded debts and default. Because of this, calls to make the data open could put farmers’ sensitive information at risk, without providing adequate benefits to make that a worthwhile tradeoff.
Not All Data is the Same
The push to make data open—prominently supported by the World Bank and the U.S. Agency for International Development (USAID), among others—has had an incredible impact on the quantity of data available for development-oriented research and for understanding the lives of the poor.
However, not all data is the same. To date, many efforts to make data open, such as by the World Bank and USAID, relate to survey data (e.g., data collected via door-to-door household interview, or as part of a projects or program evaluation). This type of data has usually been collected as a public good (e.g. with public money), and can be anonymized before being made “open.”
These factors are generally not true of big data, which is why we must be careful when deciding whether to make it open. One approach to big data is to put individuals in control of their own data. This could be through an opt-in option (see point No. 2 above) or through more direct control by the individual of the data gathered.
For example, a Nigerian farmer might provide a lender the information necessary to secure a loan (such as an e-wallet statement showing transaction history), while not opting in to the public release of his/her data, thus keeping it out of the public domain.
Another approach would be to segment information into three buckets—(Green) could safely be made public; (Yellow) if aggregated or sufficiently anonymized may be made public; and (Red) should not be made public. For example:
Segmenting big data

Red—Not to be made open
Yellow—Possibly made open with appropriate precautions
Green—Can be made open

Names
CDRs, with appropriate safeguards and consent
Aggregate figures (60 percent of District X uses mobile money)

Phone numbers
 
 

GPS coordinates
 
 

Mobile money transaction records
 
 

Identification numbers (e.g. SSNs)
 
 

Big Data as Open Data is a Two-Edged Sword
There is much that can be learned from big data that can benefit marginalized populations such as smallholder farmers. Additionally, poorly conceived data releases that enable access to individuals’ information can lead to misuse of such information to the determent of the individual.
In light of often inadequate consumer protection regulations regarding data privacy, the burden in many cases falls on other stakeholders—companies or governments that hold or control data, donors and funders who hold influence, implementing organizations, and others involved in decision-making—to decide which data to share and when.
This post was originally published as Should Big Data Be Open Data? on the [email protected] Blog