Protection of personal identifiable information (PII) in big data deployments remains a big concern because current technology designed to protect such information cannot guarantee its safety.
While businesses use PII to serve up targeted ads, products and services (something that’s considered a benefit by some), the exposure of PII can render a person vulnerable to unwarranted scrutiny, potential profiling and discrimination or exclusion based on demographic data, according to an article in the Stanford Law Review.
In order to protect people from biases and misuse of their personal information, some organizations use de-identification to uncouple information identifying person from other data associated to that person.
De-identification methods such as anonymization, pseudonymization, encryption, key-coding and data sharing to separate PII from actual identities.
Anonymization involves removing names, addresses and social security numbers; pseudonymization replaces this data with nicknames and other artificial identifiers. Key-coding, encodes PII and creates a key for decoding them.
Sharding breaks off parts of the data in a horizontal partition. This method provides just enough data to work with but not enough to identity a person.
However, the problem is that current de-identification techniques can be countered by re-identification strategies,
Once you have “even one type of data to work with,” according to Keith Carter, adjunct professor at the business school of the National University of Singapore, data can be pieced back together again in many way.
Carter spoke recently at the Big Data World Asia 2013 conference.
For instance, he said, a business or government is able to get hold of a list of GPS records covering a year, it could be used to determine the identity of a person.
Organizations can deduce the identity of a person, Carter said, by pinpointing the address that person “regularly come from at seven or eight in the morning.” Researchers will be able to determine if the person went to an office or a school.
From this point, addresses and names could be obtained with a high degree of accuracy using public address tools.
Vulnerabilities like these exist because big data systems where never intended to do what they do today, said Brian Christian , chief technology officer for Zettaset Inc., a big data management platform firm.
He said enterprises manage big data systems that are complex and need to execute multiple hand-offs to other systems. Each hand-off is a vulnerable junction, he said.
While de-identification has become a key component of business models in areas such as healthcare, online advertising and cloud computing, it may not be able to provide an adequate solution to big data privacy concerns.
There’s a notion that businesses and governments have spent a lot of money on de-identification to protect PII, Carter said, but what they may have accomplished is to provide themselves with a safe harbor by using de-identification.