Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Springer Science & Business Media
1
Free sample

Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database. Based on research in various domains including applied statistics, health informatics, data mining, machine learning, artificial intelligence, database management, and digital libraries, significant advances have been achieved over the last decade in all aspects of the data matching process, especially on how to improve the accuracy of data matching, and its scalability to large databases.

Peter Christen’s book is divided into three parts: Part I, “Overview”, introduces the subject by presenting several sample applications and their special challenges, as well as a general overview of a generic data matching process. Part II, “Steps of the Data Matching Process”, then details its main steps like pre-processing, indexing, field and record comparison, classification, and quality evaluation. Lastly, part III, “Further Topics”, deals with specific aspects like privacy, real-time matching, or matching unstructured data. Finally, it briefly describes the main features of many research and open source systems available today.

By providing the reader with a broad range of data matching concepts and techniques and touching on all aspects of the data matching process, this book helps researchers as well as students specializing in data quality or data matching aspects to familiarize themselves with recent research advances and to identify open research challenges in the area of data matching. To this end, each chapter of the book includes a final section that provides pointers to further background and research material. Practitioners will better understand the current state of the art in data matching as well as the internal workings and limitations of current systems. Especially, they will learn that it is often not feasible to simply implement an existing off-the-shelf data matching system without substantial adaption and customization. Such practical considerations are discussed for each of the major steps in the data matching process.
Read more
Collapse

About the author

Peter Christen is Senior Lecturer at the Research School of Computer Science at the Australian National University in Canberra, Australia. His research interests are data mining, with a focus on data matching, and privacy-preserving data sharing and mining. He has published over 50 papers in these areas, and he is the principle developer of the `Febrl' (Freely Extensible Biomedical Record Linkage) open source data cleaning, deduplication and record linkage system.
Read more
Collapse
5.0
1 total
Loading...

Additional Information

Publisher
Springer Science & Business Media
Read more
Collapse
Published on
Jul 4, 2012
Read more
Collapse
Pages
272
Read more
Collapse
ISBN
9783642311642
Read more
Collapse
Read more
Collapse
Read more
Collapse
Language
English
Read more
Collapse
Genres
Computers / Computer Vision & Pattern Recognition
Computers / Databases / Data Mining
Computers / Databases / General
Computers / Information Technology
Computers / Intelligence (AI) & Semantics
Computers / Optical Data Processing
Computers / System Administration / Storage & Retrieval
Read more
Collapse
Content Protection
This content is DRM protected.
Read more
Collapse

Reading information

Smartphones and Tablets

Install the Google Play Books app for Android and iPad/iPhone. It syncs automatically with your account and allows you to read online or offline wherever you are.

Laptops and Computers

You can read books purchased on Google Play using your computer's web browser.

eReaders and other devices

To read on e-ink devices like the Sony eReader or Barnes & Noble Nook, you'll need to download a file and transfer it to your device. Please follow the detailed Help center instructions to transfer the files to supported eReaders.
The lives of people all around the world, especially in industrialized nations, continue to be changed by the presence and growth of the Internet. Its in?uence is felt at scales ranging from private lifestyles to national economies, boosting thepaceatwhichmoderninformationandcommunicationtechnologiesin?uence personal choices along with business processes and scienti?c endeavors. In addition to its billions of HTML pages, the Web can now be seen as an open repository of computing resources. These resources provide access to computational services as well as data repositories, through a rapidly growing variety of Web applications and Web services. However, people’s usage of all these resources barely scratches the surface of the possibilities that such richness should o?er. One simple reason is that, given the variety of information available and the rate at which it is being extended, it is di?cult to keep up with the range of resources relevant to one’s interests. Another reason is that resources are o?ered in a bewildering variety of formats and styles, so that many resources e?ectively stand in isolation. This is reminiscent of the challenge of enterprise application integration, - miliar to every large organization be it in commerce, academia or government. Thechallengearisesbecauseoftheaccumulationofinformationandcommuni- tion systems over decades, typically without the technical provision or political will to make them work together. Thus the exchange of data among those s- tems is di?cult and expensive, and the potential synergetic e?ects of combining them are never realized.
This book addresses the problems that are encountered, and solutions that have been proposed, when we aim to identify people and to reconstruct populations under conditions where information is scarce, ambiguous, fuzzy and sometimes erroneous.

The process from handwritten registers to a reconstructed digitized population consists of three major phases, reflected in the three main sections of this book. The first phase involves transcribing and digitizing the data while structuring the information in a meaningful and efficient way. In the second phase, records that refer to the same person or group of persons are identified by a process of linkage. In the third and final phase, the information on an individual is combined into a reconstruction of their life course.

The studies and examples in this book originate from a range of countries, each with its own cultural and administrative characteristics, and from medieval charters through historical censuses and vital registration, to the modern issue of privacy preservation. Despite the diverse places and times addressed, they all share the study of fundamental issues when it comes to model reasoning for population reconstruction and the possibilities and limitations of information technology to support this process.

It is thus not a single discipline that is involved in such an endeavor. Historians, social scientists, and linguists represent the humanities through their knowledge of the complexity of the past, the limitations of sources, and the possible interpretations of information. The availability of big data from digitized archives and the need for complex analyses to identify individuals calls for the involvement of computer scientists. With contributions from all these fields, often in direct cooperation, this book is at the heart of the digital humanities, and will hopefully offer a source of inspiration for future investigations.
"This library is useful for practitioners, and is an excellent tool for those entering the field: it is a set of computer vision algorithms that work as advertised."-William T. Freeman, Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology

Learning OpenCV puts you in the middle of the rapidly expanding field of computer vision. Written by the creators of the free open source OpenCV library, this book introduces you to computer vision and demonstrates how you can quickly build applications that enable computers to "see" and make decisions based on that data.

Computer vision is everywhere-in security systems, manufacturing inspection systems, medical image analysis, Unmanned Aerial Vehicles, and more. It stitches Google maps and Google Earth together, checks the pixels on LCD screens, and makes sure the stitches in your shirt are sewn properly. OpenCV provides an easy-to-use computer vision framework and a comprehensive library with more than 500 functions that can run vision code in real time.

Learning OpenCV will teach any developer or hobbyist to use the framework quickly with the help of hands-on exercises in each chapter. This book includes:A thorough introduction to OpenCVGetting input from camerasTransforming imagesSegmenting images and shape matchingPattern recognition, including face detectionTracking and motion in 2 and 3 dimensions3D reconstruction from stereo visionMachine learning algorithms

Getting machines to see is a challenging but entertaining goal. Whether you want to build simple or sophisticated vision applications, Learning OpenCV is the book you need to get started.

©2019 GoogleSite Terms of ServicePrivacyDevelopersArtistsAbout Google|Location: United StatesLanguage: English (United States)
By purchasing this item, you are transacting with Google Payments and agreeing to the Google Payments Terms of Service and Privacy Notice.