Elias Zavitsanos, G. Paliouras, G. Vouros
Web Intell. Agent Syst.
Abstract. This paper proposes a method for learning ontologies given a corpus of text documents. The method identiﬁes conceptsin documents and organizes them into a subsumption hierarchy, without presupposing the existence of a seed ontology. Themethod uncovers latent topics for generating document text. The discovered topics form the concepts of the new ontology.Concept discovery is done in a language neutral way, using probabilistic space reduction techniques over the original term spaceof the corpus. Furthermore, the proposed method constructs a subsumption hierarchy of the concepts by performing conditionalindependence tests among pairs of latent topics, given a third one. The paper provides experimental results on the Genia and theLonely Planet corpora from the domains of molecular biology and tourism respectively.Keywords: Ontology Learning, Concept Discovery, Subsumption Hierarchy Construction, Latent Dirichlet Allocation, ConditionalIndependence 1. IntroductionOntologies have been proposed as the key ele-ment to shape, manage and further process knowledge.However, the engineering of ontologies is a costly,time-consuming and error-prone task when done man-ually. Furthermore, in quickly evolving domains ofknowledge, or in cases where information is constantlybeing updated, possibly making prior knowledge obso-lete, the continuous maintenance and evolution of on-tologies are tasks that require signiﬁcant human effort.Thus, there is a strong need to automate the ontologydevelopment/maintenance tasks in order to minimizethe cost of ontology creation and evolution.For this reason, ontology learning has emerged as aﬁeld of research, aiming to help knowledge engineersto build and further extend ontologies with the help ofautomated or semi-automated machine learning tech-niques, exploiting several sources of information. On-tology learning is commonly viewed (, , ,) as the task of extending or enriching an exist-ing ontology with new ontology elements mined fromtext corpora. Depending on the ontology elements be-ing discovered, existing approaches deal with the iden-tiﬁcation of concepts, subsumption relations amongconcepts, instances of concepts, or concept proper-ties/relations. Linguistic, statistical, or machine learn-ing techniques are used for these tasks.The seed ontology used in ontology enrichmentis usually a hierarchical backbone of concepts, re-lated via subsumption relations, or a generic ontologythat formalizes some of the concepts in a documentcollection. Linguistic approaches additionally sufferfrom language dependence, as they rely on language-speciﬁc lexico-syntactic patterns.In contrast to the majority of the existing work, thispaper proposes an automated approach to ontologylearning, without presupposing the existence of a seedontology, or any other type of external resource, exceptthe corpus of training text documents. The proposed