Privacy Inclined Sequential Publishing of dynamic Social Network data

Jyothi Vadisala, Awarded

Social networks are being applied to a wide variety of applications. Social networks model the social activities between individuals, that change as time goes by. Data privacy is a major problem that has to be considered for dynamic networks before releasing network data to the public or third parties like researchers and analyzers that would compute statistics or make a deep analysis of these data. As the demand for dynamic social network analysis increases, the privacy issue in sequential publishing of social network needs to be considered. In this context, many different anonymization techniques have been proposed in the literature.

This thesis aims to address the privacy risks in dynamic social networks and to contribute a few techniques for data anonymization while optimising the privacy- utility tradeoff. Specifically, the proposed models deal with emerging privacy issues in sequential publishing of social network data. These type of data are derived from activities of our everyday life, thus the privacy of such data is impor- tant. The main objective of this thesis work is divided into two research problems.

First research problem, addresses the privacy risks of identity disclosures in sequential releases of a dynamic network. The aim is to propose a privacy model that provides adequate privacy protection and retains enough graph data utility in social network data publication. Specifically, this proposed kw-NMF anonymity model, protects identity disclosure in the context of number of common friends as a background knowledge of an adversary. To achieve this kw-NMF anonymity Number of Mutual Friend Anonymization (NMFA) greedy algorithm is proposed for large scale graph data by adding edges in anonymization process.

Performance evaluation is done on both real and synthetic data sets. The results show that the dynamic kw-NMF anonymity model can retain much of the characteristics of the network while confirming the privacy protection. The metrics used for evaluation are Graph Modification, Average Shortest Path Lengths and Average Clustering Coefficients from the aspects of protections against different information hysteresis (w) and different protection level requirements (k). Experimental results shows that the proposed algorithm can retain much of the characteristics of a dynamic network while confirming the privacy protection.

Second research problem, focuses on how the anonymization methods preserves in the existing communities of the original social networks. The k-degree and k-NMF anonmyization methods are used for anonymizing original social network and de- tect the communities for both original and anymomized networks by heuristic algorithm modular optimization Louvain method. The preservation of community is measured by Percentage of Naive Community Preservation (%NCP) and Percentage of Community Preservation at Node Level (%CPNL).

The performances of the two anonymized social networks are compared with the original social network. The experimental results show that the k-NMF anonymization method preserves the most of the communities of the original social network than k-degree anonymization method.

Feature Selection and Extraction for Inverse Synthetic Aperture Radar Images for Ship Detection and Classification

B. Mamatha

Ocean surveillance is becoming significant since the military coastal activities are becoming more and more frequent, the ship types are increasing fast and the unconventional activities happen often. The traditional sea surveillance methods of monitoring the sea targets only by the metric information such as range, aspect, speed and direction of the target is do not giving proper results for identification of a ship. So, there is a requirement to develop automatic ship recognition systems. A few such recognition systems with ship detection software already exist. Now the challenge is to quantify detection capability and to design new software systems to better detect ship signatures to classify the ship targets of interest. In this work an Feature selection and extraction of Inverse Synthetic Aperture Radar (ISAR) image for ship classification is taken as a research problem. Ship ISAR image is a high resolution radar image.  Radar is a popular sensor for it can be used to acquire target signatures in all conditions and also from hundreds of kilometers of distance. The primary objective is to acquire target characteristics or features from the target signatures acquired through radar. Then these features are used for recognition and classification of targets. ISAR images are radar images acquired through high resolution radar which are widely being used for recognition of sea, air, land and even human targets since they contain useful information of target geometry.

In this work a set of ship ISAR images measured with high resolution radar are taken as a data set. The sea clutter and noise from nearby terrain causes hindrance to obtain clear ship ISAR images. Hence preprocessing of ISAR images is the first step and extraction of ship features having highly discriminative nature and low dimensionality from noise free ISAR images is the next step in ship classification. The problem of classification can be solved by using neural networks. Different feature extraction techniques are applied to the preprocessed ISAR images from the data sets.  The features computed from training data set are used to train the classifier and the features obtained from test set are utilized to quantify the classification accuracy of the feature vector. In this work probabilistic neural network is used as a classifier. In this work study is carried out to identify feature vectors that classify the unique set of ship ISAR images considered in this work. Various statistical techniques and mathematical transforms are used to derive features from ship ISAR images. The statistical descriptors and Zernike moments are found to give good classification accuracy. 

ISAR images are wave decomposed up to five levels with the application of multi resolution wavelet transform. For each wave, decomposed level Image, all approximation and detailed wave coefficients are computed and taken as feature vectors to identify the best suitable wave coefficients for the unique set of ship ISAR images considered in this work. Segmentation is used to find regions of interest in case of medical images, border or shape information of ISAR images or to identify linked homogeneous regions in high resolution remote sensing image analysis applications. Hence keeping in view of the significant role of segmentation in image analysis, above mentioned wave decomposed image features are also studied to find the classification accuracy of the ISAR images with segmentation.

Ship ISAR images are color images. The moments computed from individual color component images R,G,B and combined color component images  RG,GB,RB of the ISAR images taken together are considered as feature vector. Using the multi resolution wavelet transform, the ISAR images are decomposed up to five levels. The average wave energy values are computed for each of the wave decomposed images. Obtained wave energy values  are found to form a good feature vector for the ship ISAR image classification. 

In classification problems, the feature vector plays a significant role since the computational complexity and accuracy of classification depend on the features that constitute a feature vector and also the number of features in the feature vector.  All the features in a feature vector under consideration may not equally contribute to the classification of the images. Some of them may be redundant. Since a feature vector of smaller size with high discriminating nature is always preferred. Hence there is always a need to select an optimum feature set to solve the classification problem at hand. Several available optimization techniques can be explored to select optimum feature set. In this research work, ship ISAR image classification problem is studied as a single objective optimization problem and the different optimum feature combinations that give the satisfactory classification for the considered data set are identified. The optimization techniques like genetic algorithm and particle swarm optimization techniques are employed to identify the optimum feature sets in case of color moments and wave energy levels.

Generic Distributed Framework for cloud services market place based on unified ontology

Samer Hasan, January 2019 Awarded

Cloud computing is a paradigm for delivering ubiquitous and on demand resources based on pay as you use financial model. It turns the IT services into utility like: water, electricity, gas and telephony. Cloud service discovery and selection becomes a significant challenge because of exponential growth in the number of cloud service providers. Typically, Cloud service providers publish cloud service advertisements and Service Level Agreement (SLA) details in various formats on the Internet.  Consumers should explore cloud service provider websites using the existing search engines like (Google and Yahoo) to collect information about all available services and select the best one manually. Unfortunately general purpose search engines are not designed to provide a complete and small set of results that meet the consumer requirements which makes the discovery and selection process a tedious task.

This thesis presents a generic-distrusted framework for cloud services marketplace to: i) automate cloud service discovery and selection process, ii) reduce the time and effort of finding cloud services, iii) make service providers more visible to all consumers, iv) create a shared understanding of cloud service domain and v) improve the overall user experience. In addition, this thesis presents a novel algorithm for cloud services numerical similarity named Percent Distance Similarity (PDS) which is independent of any external values. To overcome the interoperability and vendor lock-in problems, this thesis presents unified cloud services ontology and models the real-life services according to proposed ontology. Finally, this thesis implements two instances of generic framework by adopting two different matching algorithms. First one is based on dominant and recessive attributes algorithm borrowed from gene science and the second one is based on a semantic similarity algorithm and unified cloud service ontology.This thesis is the first attempt to build a cloud service marketplace where cloud service providers and cloud service consumers can trade a cloud service as a utility. It is a global digital market where cloud consumer can find the solution that match his/her needs by simply submitting a request. Comparison done with existing systems on real-life cloud services collected from providers websites based on four parameters (number of matched services, execution time, average Score and recall). Semantic approach based on cloud ontology reduced the execution time by 20% and maintain the same values for all other parameters. On the other hand, Non-semantic approach reduced the execution time by 57% but showed lower value for recall.

Intelligent Evolutionary Approaches based Query Optimisation Algorithms in Distributed Database Systems

S. V Lakshmi , January 2019, PhD Awarded

The thesis work focused on minimizing the tradeoffs that exists
amongst the accuracy, minimum response time and cost of query optimisation techniques using genetic algorithms.
The performance metrics used for the proposed techniques in the thesis are the Average Cost of Query Execution Plan such as Join processing cost, local processing cost and global communication cost based on the accuracies of the executed queries results.

A Distributed Community based Social Recommender Approach using Trusted Neighbourhood

2018 G. Satya Keerthi PhD Thesis Awarded

A DISTRIBUTED COMMUNITY BASED SOCIAL RECOMMENDER APPROACH USING TRUSTED NEIGHBOURHOOD

Abstract: Recommender Systems (RS) apply knowledge discovery techniques on the fly for making personalized recommendations related to information, products or services. These systems, especially the k-nearest neighbour collaborative filtering based ones, are used widely on the web. The tremendous growth in the amount of available information and the number of visitors to websites in recent years pose some key challenges for recommender systems. The challenges are producing high quality recommendations, performing many recommendations per second for millions of users and items while achieving high coverage in the face of data sparsity. In traditional collaborative filtering systems the amount of processing increases with the number of participants in the system. New recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. To address these issues we have explored community-based content and collaborative filtering techniques in this thesis work. Collaborative techniques first analyze the co-occurrences between items to identify relationships. These relationships are then used to form communities and compute recommendations for users based on the active users’ target item. We look into different community detection techniques and explored a new distributed community approach in order to reveal the topological relations in the network. However, the user-item ratings matrix, which is used as input to the recommendation algorithm, is often highly sparse, leading to unreliable predictions. Recent studies demonstrated that information from social networks such as trust statements can be employed to improve accuracy of the recommendations. However, there are explicit trust relationships between most of users in many e-commerce applications. In this thesis, a method to identify implicit trust statements by applying a reliability measure is proposed. Finally, the results are experimentally evaluated and compared with the traditional recommender approaches. The experimental results showed that community-based algorithms provide dramatically better performance than User-Based Collaborative Filtering (UBCF) algorithms and Item-Based Collaborative Filtering (IBCF) algorithms, while providing better quality than the available user-based algorithms.

A VERSION CONTROLLED, SELF-ADJUSTING, AND QUALITY OF SERVICE-AWARE LOAD BALANCED CLUSTER

A VERSION CONTROLLED, SELF-ADJUSTING, AND QUALITY OF SERVICE-AWARE LOAD BALANCED CLUSTER

Awarded 2017

VEERABHADRA RAO CHANDAKANNA

Most of the applications that we use today like Gmail, Twitter, and Facebook are Cloud enabled and can scale dynamically. Typically, such applications are hosted on a load balanced cluster. An application accessed via a load balanced cluster is hosted on every member (server) of the cluster. The end user uses a virtual address to send his/her request to a load balancer that acts as an entry point to access the cluster services. The load balancer assigns the request to one of the cluster members. As the load balancer can assign a request to any member of the cluster, all the members must be consistent to produce correct results. A load balancing algorithm attempts to (i) optimize the throughput/response time/resource utilization or (ii) prevent overloading any specific server. The services hosted on the cluster members go through many revisions during their life-cycle. The current solutions are mostly focusing on automating the dynamic scaling, and largely ignored the cluster version management. The key requirements to manage a load balanced cluster are auto provisioning, consistency among the cluster members, a load balancer that can quickly adjust to the changing cluster environment, cluster capacity management, and Quality of Service(QoS) monitoring.

A Self-Adjusting Clustering Framework (SACF) proposed automates the cluster version management, and enforces the consistency among the cluster members. The SACF automates the application management tasks (deploying, upgrading, and retiring) executed in a cluster environment. The framework allows the users to plug-in custom logic to be triggered at various points of managing an application. The framework enforces every active cluster member to be synchronized with the cluster all the time. The framework automatically detects inconsistency of a newly started cluster member and makes the necessary corrections to make it consistent with the current cluster state.

A Sliding window based Self-learning and Adaptive Load balancer (SSAL) that can adapt to both stable and unstable cluster environments is proposed. The SSAL logically divides time into fixed size intervals, assigns the requests in batches, and makes corrections based on the observed servers’ performance in each interval. The SSAL discovers the initial capabilities of the servers and performs incremental corrections needed in the subsequent intervals. It produces throughput better than the current models in both the stable and dynamic cluster environments.

A Quality of Service aware and Self-correcting observation based Load Balancer (QSLB) extends the SSAL to prevent the single point of failure problem, manage cluster capacity, and support QoS monitoring. The QSLB optimizes the throughput and allows (i) redundant QSLBs to collaborate and estimate the cluster members’ capabilities, (ii) the newly started QSLBs to learn the cluster members’ capabilities quickly, (iii) share the cluster capacity among different users in a pre-agreed ratio, and (iv) specify Quality of Service (QoS) benchmarks, monitor the QoS parameters, and recommend changes to meet the set QoS goals. Two cluster capacity estimation models (ARSM and NRPM) to improve the QoS are proposed. Multiple centralized and distributed algorithms to implement the QSLB are proposed and evaluated. This thesis presents a comprehensive approach for managing the load-balanced clusters. The SACF framework provides cluster version management and enforces the consistency among the cluster members. A new observation based load balancer (SSAL) that can produce optimal throughput in both stable and unstable cluster environments has been proposed. The SSAL is extended (as QSLB) to prevent single point of failure, manage the cluster capacity and QoS, and estimate the cluster capacity needed to improve the QoS.

IMPROVING UTILITY IN PRIVACY PRESERVING PUBLISHING OF RELATIONAL AND SOCIAL NETWORK DATA

IMPROVING UTILITY IN PRIVACY PRESERVING PUBLISHING OF RELATIONAL AND SOCIAL NETWORK DATA

Adusumilli Sri Krishna

The digitization of our everyday lives has led to an explosion in the collection of individual data by various public, private and non-profit organizations. These organizations publish the data over internet to extend the accessibility to the out- side world, research and public for the purpose of data analysis. Privacy is a double edged sword, there should be adequate privacy to make sure that sensitive information about the individuals is not disclosed by the published data and at the same time there should be meaningful data to perform the data analysis.

With more attention being given to privacy, researchers are attracted to de- sign new privacy models and anonymization algorithms for privacy preserving data publication. Besides accomplishing the privacy guarantee as defined by the privacy model, data utility is another basic requirement that any anonymization method should satisfy. It is noteworthy to ensure that the changes done at anonymization process does not influence the quality of the data analysis. In spite of several efforts in current research works still many outstanding issues remain to be solved.

This thesis aims to contribute a few techniques for data anonymization while optimising the privacy-utility tradeoff. Specifically, the proposed models deal with emerging privacy issues in relational data and social network data. These two types of data are derived from activities of our everyday life, thus the privacy of such data is important. The main objective of this thesis work is divided into three research problems.

First research problem, addresses the issues involved in creating concept hier- archy trees. Concept hierarchy trees are models created with the help of domain experts and are used in data anonymization. The manual construction of concept hierarchy trees might cause problems such as erroneous classifications or omissions of concepts. This research proposes an approach to construct the concept hierarchy tree dynamically by considering the normalized distance between the nodes and frequency of the attribute values. This method produces better concept hierarchies for both generalization and suppression for anonymization process which significantly improved the utility of the anonymized data when compared to existing methods.

Performance evaluation is done for the proposed method Dynamic Concept Hierarchy Tree (DCHT) against known methods, k -member clustering anonymization and Mondrian multi-dimensional algorithm using 1) Improved on the fly hierarchy (IOTF) 2) On the fly hierarchy (OTF) 3) Hierarchy free (HF) 4) Predefined hierarchy (PH) 5) CHU 6) HAN methods. The metrics used for evaluation are a) Information loss, b) Discernibility metric, c) Normalized average equivalence size metric. Experimental results indicate that DCHT approach is more effective and flexible and the utility is 12% better than IOTF, 16% better than OTF & CHU, 17% better than PH and 21% better than HAN methods when applied on Mondrain multi-dimensional algorithm.

Second research problem addresses the problems encountered in the process of k-anonymization. Most of these were found to be performing over-generalization because full-domain and sub-tree generalization techniques were used. This thesis overcomes over-generalization by defining priority for each quasi-identifer attribute based on concept hierarchy tree and their value distribution in the given data. It balances the privacy and utility tradeoff in data anonymization process by em- ploying cell level generalization or local recoding scheme.

The proposed Priority Based Local Recoding (PBLR) algorithm gives the lowest information loss for all k values and always produces equivalence classes with sizes very close to the specified k compared to Incognito algorithm. The information loss of the PBLR algorithm was found to be 3.73 times lower than that of the Incognito algorithm.

Third research problem, focuses on sensitive attribute disclosure in social net- work data publication. The aim is to propose a new privacy model that provides adequate privacy protection and retains enough data utility in social network data publication. Specifically, this proposed (α,k)-Safe anonymity model, protects both identity and sensitive attribute disclosure in the context of different background knowledges of an adversary. To achieve this (α, k)-Safe anonymity Sensitive Safe Anonymization (SESAAN) algorithm is proposed for large scale graph data.

The effectiveness of SESAAN is verified by conducting extensive set of experiment on real data. The performance of SESAAN is measured in terms of number of edges added or deleted for different k-values with α=0.5. The anonymized graph utility is measured in terms of degree distribution, clustering coefficients and average path length.

Cooperative Game Theory Based Privacy Preserving Techniques for Data Publishing

Cooperative Game Theory Based Privacy Preserving Techniques for Data Publishing

L. S Chakravarty

Awarded 2015

In recent years, the nature of the adversary’s behaviour is a witness for the drastic increase of the need for new mechanisms demanding Privacy Preserving Data Publishing (PPDP). Meeting this demand is quite challenging since privacy is subject to several constraints such as adversary background knowledge, information loss, and utility. For instance, in record linkage problem of privacy the adversary can easily identify the individual’s sensitive information like age, disease, by linking
the published data with other data sources.

Enormous data are being collected on different aspects of people’s lives such as their credit card transaction records, phone call and email lists, personal health information and web browsing habits. Most of this data is to be scanned for important information for other purposes like evolution of credit risk. Such analysis of private information often raises concerns regarding the privacy rights of individuals’ and organizations. Disobedience to privacy protocols and adversary’s background information may lead to privacy attacks. In this regard, privacy community people have proposed different noteworthy principles like k-Anonymity, l-Diversity and -Differential privacy.

Cooperation among objects like people, organization etc., has recently emerged as a novel paradigm that can yield tremendous performance gains and enables efficient and new models for privacy preserving data publishing. This thesis focuses on how Cooperative Game Theory (CGT) helps in the study of how an agent behaves, in cooperative manner, in formation of groups in data privacy context.

To overcome some of the shortcomings in k-Anonymity, this work presents new approaches to achieve k-Anonymity using Concept Hierarchy Trees (CHT) based on CGT. The observed shortcomings in existing works are: i) The anonymization is done based on an assumed value of k, ii) the information loss can be found only after anonymization is done, iii) if the information loss is found to be more than the affordable loss then another value of k is to be considered and the whole process of anonymity has to be repeated. To achieve anonymity, coalitions are established between the tuples based on their payoffs which are assigned using (CHT) of
Quasi Identifiers (QID). The new system allows the user to control the required information loss as anonymization metric in data publishing. It also presents a new approach to reduce information loss and in turn reduces anonymization level while achieving k-Anonymity using CHT with weighted levels. The experimental results showing the practicality and scalability are presented.

Co-Operative Privacy Game (CoPG) is proposed in this work to achieve data
privacy. The main objective of the CoPG is to obtain individual’s (player) privacy as a goal that is rationally interested in other individuals (players) privacy. CoPG is formally defined in terms of Nash equilibria, i.e., all the players are in their best coalition, to achieve k-anonymity. The Cooperative Value (CoV ) of each tuple are measured using the characteristic function of the CoPG to identify the coalitions. As the underlying game is convex; the algorithm is efficient and yields high quality coalition formation with respect to intensity and disperse. The variations of the information loss with the parameters β (weight factor of ’nearness’) and γ (multiplicity) are analyzed and the obtained results are discussed.

 

 

GRAPH BASED PARAMETER-LITE CLUSTERING ALGORITHM USING EUCLIDEAN AND MAHALANOBIS DISTANCE METRICS

GRAPH BASED PARAMETER-LITE CLUSTERING ALGORITHM USING EUCLIDEAN AND MAHALANOBIS DISTANCE METRICS

B.H.V.S. Ramakrishnam Raju

Awarded 2014

Data mining technology has emerged as a means for identifying patterns and trends from large quantities of data. Clustering groups most similar data and has wide range of applications. Clustering algorithms are classified into partition algorithms, hierarchical algorithms, density and grid based algorithms etc. Partition algorithms are simple and efficient but are sensitive to the noise present in the dataset. A problem accompanying the use of a partition algorithm is the choice of the number of desired output clusters. Hierarchical algorithms can detect clusters with different shapes, but these algorithms demand high computational cost. The density-based and grid based clustering algorithms suitably handle arbitrary shaped collections of points as well as clusters of different sizes. They can also efficiently separate noise. However these algorithms are very sensitive to the initial setting of parameters i.e. the desired number of clusters. Hence main problem with clustering is to decide the optimal number of clusters that fits a data set.

As a consequence, if the parameters are not set properly, the clustering method may not result in an optimal partitional scheme and may lead to improper decisions. The problem of deciding the number of clusters better fitting a given dataset has been the subject of this research. To overcome the problems associated with initial setting of parameters, a novel clustering algorithm namely PLMST (Parameter-Lite Minimum Spanning Tree) based on minimum spanning tree (a connected acyclic graph) split and fuzzy similarity based merging technique is proposed in this thesis. The proposed algorithm does not require the number of clusters to be set initially. The algorithm starts by creating the minimum spanning tree for the given dataset and in each iteration the inconsistent edge is removed until the termination criteria is reached. This gives rough estimation of number of clusters present in the dataset. The number of clusters is then fine tuned by eliminating the outlier clusters, which are defined based on the number of data items in those clusters. Once these clusters are known, the optimal number of clusters is generated using fuzzy based similarity merging with minimum effort.

The artificial (synthetic) and real-life data sets are considered for testing the performance of the proposed algorithm. The basis for selecting the synthetic datasets is that they can be designed to contain a certain number of clusters and are easy to manage. The 2-D synthetic or artificial datasets, called Data_A, Data_B and a bench mark dataset called Ruspini are used for testing the algorithm. The real datasets tested in our implementations are Iris, SPECT (Heart), Statlog , Soybean, Breast Tissue and Wine datasets which are obtained form the UCI Machine Learning Repository web site. The parameters required for the proposed algorithm are the number of data points in a cluster to find an outlier cluster and the fuzzifier used for fuzzy clustering. It works on numerical data and assumes that the graph can be calculated in a vector space.

Experiments on both the synthetic datasets and real datasets demonstrated that the proposed algorithm can discover the patterns automatically and also can compete with traditional clustering algorithms with minimum user intervention. The similarity measure used in the proposed algorithm is based on Euclidean distance. The Euclidean distance is very sensitive to scales of variables involved and independent of correlated variables. To conquer these drawbacks a hybrid clustering algorithm based on Mahalanobis distance is proposed. To evaluate the proposed methodology, the performance of the proposed algorithm is also compared to outcomes received for the same datasets by means of several other clustering approaches like standard K-means, FCM, Gustafson and Kessel (GK) and Gath-Geva (GG) clustering algorithms.

An important direction for further study is to study the application of the PLMST algorithm to the more general case, where the number of clusters and the threshold value must also be solved. The number of iterations must also be controlled.