College Papers

The be taken to protect the data such

The advance in technology has impacted almost
everything in the world, one of the biggest changes being the volume of the
data collected and stored. Each day as we use the Internet, personal
information is being collected with or without our consent, as result data
holders are left with a good amount of personal data which can help understand
the behavior of the users and find a pattern. However, there is also a risk to
be able to piece together a picture of individuals private lives, due to the
difficulty in anonymization of data that compromise privacy. A lot of
organizations assume that removing the name, address and telephone number
result to anonymous data but that is not true, it has been proved that even
without these parameters information can be linked to the subject who supplied
it.

Re-identification means the process by which
anonymized personal data is matched with its true owner. With the end goal being to ensure the protection of
the users, personal identifiers, such as the 5-digit, ZIP, and date of birth
are typically expelled from datasets that contain delicate data.

Re-identification by linking consist of using private
information to find individuals in public datasets, the two datasets usually
have at least one type of information that is the same, which links the
anonymized information to an individual. Two famous cases of re-identification
are the movie ratings in the Netflix study and the leaked medical record of the
governor of Massachusetts. Technology is moving fast forward, including
re-identification techniques and as we leave more data traces online, it will
become easier to re-identify individuals if measurements are not taken
accordingly. A solution could be to require organizations to do an analysis on
the dataset prior to releasing it to the public and checking if the datasets
that are available online that can be used to re-identify the people in the
dataset.

To deal with this problem the statistical office must
determine the risk of identification of dataset which is intended to be
published. If the risk is high, measures must be taken to protect the data such
as perturbative methods which include adding random noise to the data and
introducing more variance, and this will impact the ability to make statistical
inferences.

As mentioned before,
one of the main difficulties is to be able to assess the risks of publishing an
anonymized data, it is also necessary to evaluate the accuracy of inference
attacks that can be performed by the adversary based on the released data and
other knowledge gathered. In general, the biggest problems result from
inferences that can be drawn after linking the released data to other
knowledge, the inference problem arises whenever some data X can be used to derive
partial or complete information about some other data Y, so this ability must
be controlled, this makes sense, yet it is impossible to consider every
possible attack, so we should at least secure against known assaults.