Approaches for Summarization and Identification of Data Sources in the Linked Open Data Cloud

Nasir Naveed, Thomas Gottron


Linked Open Data (LOD) available on the web not only encodes heterogamous structural relationship between entities of various types but is also a source of textual information associated with these entities. Governments at various levels can benefit from this data while making certain policy decisions. The goal of this work is to provide semi-automated means to access information from open data sources that is relevant to the policy proposal in question. This paper reports investigation of the approaches for thematic analysis of policy documents, identification and summarization techniques of linked open dataset relevant to the topics of the policy documents. The key focus of the work carried out includes the identification of the functionality based on end user needs, development of a linked open data search strategy model and investigation of techniques required to realize the needed functionality specified in the model. As a background work we have identified the use of machine learning techniques to automatically determining the main themes or topics from within the policy document for generating the search terms for searching the open and linked open data repositories, investigated the use of various approaches for finding topic connectivity structure in the linked open data cloud, methodology for topic diversity based relevance ranking of datasets and summarization algorithms for generating the dataset summaries.


Linked Open Data (LOD); Data; Data repositories; Semi-automated; Algorithms

Full Text:



· Blei DM, Ng AY, Jordan MI. (2003). Latent dirichlet allocation. Machine Learning Research. 3:993-1022

· Department TM, Minka T, Lafferty J. (2002). Expectation propagation for the generative aspect model.

In In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 352-359.

· Griffths TL and Steyvers M. (2004). Finding scientific topics. Proceedings of the National Academy of

Sciences of the United States of America. 101 Suppl. 1:5228-5235.

· Heinrich G. (2005). Parameter estimation for text analysis. 2:43.

· Hofmann T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual

international ACM SIGIR conference on Research and development in information retrieval. SIGIR '99,

pages 50-57

· Hofmann T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning.


· Kullback S, Leibler RA. On information and sufficiency. Ann. Math. Statist. 22:79-86, 1951.

· Kullback S. Information Theory and Statistics. Wiley, New York, 1959.

· Nigam K, McCallum AK, Thrun S, Mitchell T. (2000). Text classification from labelled and unlabelled

documents using em. Mach. Learn. 39(2-3):103-134

· Porter MF. (1980). An algorithm for suffix stripping. Readings in Information Retrieval. 14(3):130-137.

Contacts | Feedback
© 2002-2014 BUITEMS