Hiperstad « Break of Information Overload
Previous Entry Home Next Entry
      Friday, July 29, 2005
<div id="52985_kdub1">Hiperstad</div> - Break of Information Overload

Chern Jie

 
Hiperstad


Improvements and innovations in computer processing power, disk storage and networks have led to dramatic increases in the ability to accumulate and analyse personal data. It is now routine to merge information stored in independent databases, to access and process data online and to use data mining tools for automatic and semi-automatic exploration and pattern discovery. In parallel with these developments there are mounting concerns about the threat to personal privacy, with a call to data collectors and providers to ensure high level of security, sometimes with the request that the analysis of individual-level data be prevented altogether, However, in some circumstances it can be essential for databases to be considered at the level of the individual in order for any significant benefit to be derived from the investigation. Medical research often requires the use of patient records to gain knowledge about the nature of serious diseases; effective government policy-making can require the analysis of individual responses in a census; customer-level information can be essential to business when making marketing decisions and predictions. A control mechanism that allows sufficient evaluation of personal data while simultaneously protecting the confidentiality of individual records must therefore be applied.

If personal data is made available, even in an anonymised form, there is a risk of individuals being identified using statistical disclosure through the matching of known information with anonymised data, resulting in material specific to those individuals being revealed. This can occur via the actions of a database intruder, the unscrupulous behaviour of a researcher and in many other ways [9]. The problem of preventing statistical disclosure is approached by first estimating the probability of a certain individual being identified (the risk of disclosure) and, secondly, by applying statistical disclosure control (SDC), variously recoding, masking and perturbing the data in order to reduce the statistical disclosure risk.

In this proposal we concentrate on the identification of individual records with a high risk of disclosure. The records belonging to certain individuals have a significant chance of being identified as their contents, or attributes, are unique and therefore have the potential to be matched directly with details (including names and addresses) from another database. An illustration of a ‘risky’ record of this type is a sixteen-year-old widow in a population survey. A record can contain more than one such unique pattern and its classification often depends on the number and size of such attribute sets (referred to as uniques) that it contains [10]. The general term for records possessing a high risk of disclosure due to the nature of the uniques that they include is a special unique record [11]. The ability to comprehensively locate and grade such records would lead to more efficient disclosure control of released data but in order to carry out an exhaustive search of this nature all possible attribute sets must be checked (directly or indirectly) for uniqueness, a process which is combinatorially explosive1. The importance of speed also depends on how many times an algorithm has to be applied. SDC algorithms often have to be used repetitively on a dataset as different masking techniques are tested for their efficiency.

Existing techniques can find outliers (unusual records) in a database if a generic dissimilarity measure between records can be constructed (such as a distance metric in n-dimensional space where n is the number of attributes per record, or a measure of variation between DNA sequences in biological data) [2]. However, many databases contain categorical variables (such as Marital Status in a population survey, Diagnosis in Medical database) for which generic dissimilarity measures are difficult to derive and for which distance metrics are not relevant. For example, if an attribute for marital status were coded with 1=single, 2=married, 3=divorced it would be meaningless to state that the dissimilarity between two people, person A single and person B married, would be double if person B were divorced. In general, a record classified as an outlier does not automatically contain unique patterns which can be directly matched with records in an independent database and the detection of outliers does not automatically guarantee that all records containing unique attribute patterns are identified. A more focused approach is therefore required.

Previous work in SDC has led to the development of algorithms that have been designed to protect the confidentiality of individual records under certain conditions, for example techniques for modifying classifiers so as to protect record-level privacy [12] and techniques for assessing the maximum number of queries that can be made without compromising the confidentiality of a given database [5], but these have not addressed the uniques problem directly. Although some areas of SDC research have focused on the location of ‘risky’ records [11, 8, 13, 20] this work, although theoretically interesting, does not overcome the combinatorially explosive properties of the search for uniques and cannot yet be associated with any given dataset.



posted by Information Overload at 01:15 am


   

Leave a Comment:

Name


Homepage (optional)


Comments




     
 

About Me

Contact Me

Web this site
So, what brings you here? My sources says you came from . This is where I keep news clippings of articles I find worth remembering. It's here for easy reference. Enjoy it! Hope you enjoy your stay.

Previous ClipPosting:

Calendar

Tagboard
   

Links


Enter your email address:

RSS



Blogarama - The Blog Directory


Google
Web this site

Contact
Contact Me



rss feed