A Short Survey of Document Structure Similarity Algorithms

A Short Survey of Document Structure Similarity Algorithms
Author: D. Buttler
Publisher:
Total Pages: 9
Release: 2004
Genre:
ISBN:

This paper provides a brief survey of document structural similarity algorithms, including the optimal Tree Edit Distance algorithm and various approximation algorithms. The approximation algorithms include the simple weighted tag similarity algorithm, Fourier transforms of the structure, and a new application of the shingle technique to structural similarity. We show three surprising results. First, the Fourier transform technique proves to be the least accurate of any of approximation algorithms, while also being slowest. Second, optimal Tree Edit Distance algorithms may not be the best technique for clustering pages from different sites. Third, the simplest approximation to structure may be the most effective and efficient mechanism for many applications.

Structure and Content Semantic Similarity Detection of EXtensible Markup Language Documents Using Keys

Structure and Content Semantic Similarity Detection of EXtensible Markup Language Documents Using Keys
Author: Waraporn Viyanon
Publisher:
Total Pages: 246
Release: 2010
Genre: XML (Document markup language)
ISBN:

"XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms"--Abstract, leaves iv-v.

Document Analysis Systems

Document Analysis Systems
Author: Xiang Bai
Publisher: Springer Nature
Total Pages: 588
Release: 2020-08-14
Genre: Computers
ISBN: 3030570584

This book constitutes the refereed proceedings of the 14th IAPR International Workshop on Document Analysis Systems, DAS 2020, held in Wuhan, China, in July 2020. The 40 full papers presented in this book were carefully reviewed and selected from 57 submissions. The papers are grouped in the following topical sections: character and text recognition; document image processing; segmentation and layout analysis; word embedding and spotting; text detection; and font design and classification. Due to the Corona pandemic the conference was held as a virtual event .

Introduction to Information Retrieval

Introduction to Information Retrieval
Author: Christopher D. Manning
Publisher: Cambridge University Press
Total Pages:
Release: 2008-07-07
Genre: Computers
ISBN: 1139472100

Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.

Fundamentals of Predictive Text Mining

Fundamentals of Predictive Text Mining
Author: Sholom M. Weiss
Publisher: Springer
Total Pages: 249
Release: 2015-09-07
Genre: Computers
ISBN: 1447167503

This successful textbook on predictive text mining offers a unified perspective on a rapidly evolving field, integrating topics spanning the varied disciplines of data science, machine learning, databases, and computational linguistics. Serving also as a practical guide, this unique book provides helpful advice illustrated by examples and case studies. This highly anticipated second edition has been thoroughly revised and expanded with new material on deep learning, graph models, mining social media, errors and pitfalls in big data evaluation, Twitter sentiment analysis, and dependency parsing discussion. The fully updated content also features in-depth discussions on issues of document classification, information retrieval, clustering and organizing documents, information extraction, web-based data-sourcing, and prediction and evaluation. Features: includes chapter summaries and exercises; explores the application of each method; provides several case studies; contains links to free text-mining software.

Similarity-Based Pattern Analysis and Recognition

Similarity-Based Pattern Analysis and Recognition
Author: Marcello Pelillo
Publisher: Springer Science & Business Media
Total Pages: 293
Release: 2013-11-26
Genre: Computers
ISBN: 1447156285

This accessible text/reference presents a coherent overview of the emerging field of non-Euclidean similarity learning. The book presents a broad range of perspectives on similarity-based pattern analysis and recognition methods, from purely theoretical challenges to practical, real-world applications. The coverage includes both supervised and unsupervised learning paradigms, as well as generative and discriminative models. Topics and features: explores the origination and causes of non-Euclidean (dis)similarity measures, and how they influence the performance of traditional classification algorithms; reviews similarity measures for non-vectorial data, considering both a “kernel tailoring” approach and a strategy for learning similarities directly from training data; describes various methods for “structure-preserving” embeddings of structured data; formulates classical pattern recognition problems from a purely game-theoretic perspective; examines two large-scale biomedical imaging applications.

Advances in Information Retrieval

Advances in Information Retrieval
Author: Fabio Crestani
Publisher: Springer
Total Pages: 376
Release: 2003-07-31
Genre: Computers
ISBN: 3540458867

The annual colloquium on information retrieval research provides an opportunity for both new and established researchers to present papers describing work in progress or ?nal results. This colloquium was established by the BCS IRSG(B- tish Computer Society Information Retrieval Specialist Group), and named the Annual Colloquium on Information Retrieval Research. Recently, the location of the colloquium has alternated between the United Kingdom and continental Europe. To re?ect the growing European orientation of the event, the colloquium was renamed “European Annual Colloquium on Information Retrieval Research” from 2001. Since the inception of the colloquium in 1979 the event has been hosted in the city of Glasgow on four separate occasions. However, this was the ?rst time that the organization of the colloquium had been jointly undertaken by three separate computer and information science departments; an indication of the collaborative nature and diversity of IR research within the universities of the West of Scotland. The organizers of ECIR 2002 saw a sharp increase in the number of go- quality submissions in answer to the call for papers over previous years and as such 52 submitted papers were each allocated 3 members of the program committee for double blind review of the manuscripts. A total of 23 papers were eventually selected for oral presentation at the colloquium in Glasgow which gave an acceptance rate of less than 45% and ensured a very high standard of the papers presented.