Taxonomy Embeddings on PubMed Article Subject Headings
Abstract
Machine learning approaches for hierarchical partial-orders, such as taxonomies, are of increasing interest in the research community, though practical applications have not yet emgerged. The basic intuition of hierarchical embeddings is that some signal from taxonomic knowledge can be harnessed in broader machine learning problems; when we learn similarity of words using word embeddings, the similarity of *lion* and *tiger* are indistinguishable from the similarity of *lion* and *animal*. The ability to tease apart these two kinds of similarities in a machine learning setting yields improvements in quality as well as enabling the exploitation of the numerous human-curated taxonomies available across domains, while at the same time improving upon known taxonomic organization problems, such as partial or conditional membership. We explore some of the practical problems in learning taxonomies using bayesian networks, partial order embeddings, and box lattice embeddings, where box containment represents category containment. Using open data from pubmed articles with human assigned MeSH labels, we investigate the impact of taxonomic information, negative sampling, instance sampling, and objective functions to improve performance on the taxonomy learning problem. We discovered a particular problem for learning box embeddings for taxonomies we called the box crossing problem, and developed strategies to overcome it. Finally we make some initial contributions to using taxonomy embeddings to improve another learning problem: inferring disease (anatomical) locations from their use as subject labels in journal articles. In most experiments, after our improvements to box models, the box models outperformed the simpler Bayes Net approach as well as Order Embeddings.