IMPROVING UTILITY IN PRIVACY PRESERVING PUBLISHING OF RELATIONAL AND SOCIAL NETWORK DATA

IMPROVING UTILITY IN PRIVACY PRESERVING PUBLISHING OF RELATIONAL AND SOCIAL NETWORK DATA

Adusumilli Sri Krishna

The digitization of our everyday lives has led to an explosion in the collection of individual data by various public, private and non-profit organizations. These organizations publish the data over internet to extend the accessibility to the out- side world, research and public for the purpose of data analysis. Privacy is a double edged sword, there should be adequate privacy to make sure that sensitive information about the individuals is not disclosed by the published data and at the same time there should be meaningful data to perform the data analysis.

With more attention being given to privacy, researchers are attracted to de- sign new privacy models and anonymization algorithms for privacy preserving data publication. Besides accomplishing the privacy guarantee as defined by the privacy model, data utility is another basic requirement that any anonymization method should satisfy. It is noteworthy to ensure that the changes done at anonymization process does not influence the quality of the data analysis. In spite of several efforts in current research works still many outstanding issues remain to be solved.

This thesis aims to contribute a few techniques for data anonymization while optimising the privacy-utility tradeoff. Specifically, the proposed models deal with emerging privacy issues in relational data and social network data. These two types of data are derived from activities of our everyday life, thus the privacy of such data is important. The main objective of this thesis work is divided into three research problems.

First research problem, addresses the issues involved in creating concept hier- archy trees. Concept hierarchy trees are models created with the help of domain experts and are used in data anonymization. The manual construction of concept hierarchy trees might cause problems such as erroneous classifications or omissions of concepts. This research proposes an approach to construct the concept hierarchy tree dynamically by considering the normalized distance between the nodes and frequency of the attribute values. This method produces better concept hierarchies for both generalization and suppression for anonymization process which significantly improved the utility of the anonymized data when compared to existing methods.

Performance evaluation is done for the proposed method Dynamic Concept Hierarchy Tree (DCHT) against known methods, k -member clustering anonymization and Mondrian multi-dimensional algorithm using 1) Improved on the fly hierarchy (IOTF) 2) On the fly hierarchy (OTF) 3) Hierarchy free (HF) 4) Predefined hierarchy (PH) 5) CHU 6) HAN methods. The metrics used for evaluation are a) Information loss, b) Discernibility metric, c) Normalized average equivalence size metric. Experimental results indicate that DCHT approach is more effective and flexible and the utility is 12% better than IOTF, 16% better than OTF & CHU, 17% better than PH and 21% better than HAN methods when applied on Mondrain multi-dimensional algorithm.

Second research problem addresses the problems encountered in the process of k-anonymization. Most of these were found to be performing over-generalization because full-domain and sub-tree generalization techniques were used. This thesis overcomes over-generalization by defining priority for each quasi-identifer attribute based on concept hierarchy tree and their value distribution in the given data. It balances the privacy and utility tradeoff in data anonymization process by em- ploying cell level generalization or local recoding scheme.

The proposed Priority Based Local Recoding (PBLR) algorithm gives the lowest information loss for all k values and always produces equivalence classes with sizes very close to the specified k compared to Incognito algorithm. The information loss of the PBLR algorithm was found to be 3.73 times lower than that of the Incognito algorithm.

Third research problem, focuses on sensitive attribute disclosure in social net- work data publication. The aim is to propose a new privacy model that provides adequate privacy protection and retains enough data utility in social network data publication. Specifically, this proposed (α,k)-Safe anonymity model, protects both identity and sensitive attribute disclosure in the context of different background knowledges of an adversary. To achieve this (α, k)-Safe anonymity Sensitive Safe Anonymization (SESAAN) algorithm is proposed for large scale graph data.

The effectiveness of SESAAN is verified by conducting extensive set of experiment on real data. The performance of SESAAN is measured in terms of number of edges added or deleted for different k-values with α=0.5. The anonymized graph utility is measured in terms of degree distribution, clustering coefficients and average path length.