Concept Discovery from Text via Knowledge Transfer

A better way for systems to organize, file, or index documents or content based on actual or anticipated information needed in the form of a user query or natural language question.

The Need

Data Processing and (IT)-related activities, ranging from web hosting to automated data entry services are more important than ever due to the large amounts of data collected through technology. According to IBIS World, "Companies will increasingly capture more data, requiring the outside expertise of industry operators to manage their data needs."

More specifically, with the large amounts of data shared, searched, and collected by companies, it becomes prohibitively more expensive to hire workers to search, classify, and manually annotate or index documents and data necessary to stay competitive in the global marketplace.

The Technology

Dr. Das and her colleagues at OSU have provided an improved way for systems such as: potential search engines, manual indexes, question-answer knowledge bases or in-memory indexes to make associations between related concepts either ahead of time in anticipation of future or current needs; allowing for more accurate or diverse retrieval of items based on context. This technology leverages several novel ideas such as:

1) a novel use for a neural language modeling approach that leverages shared context between documents within a collection via phrase-based embeddings

2) A fully unsupervised, i.e., it includes no outside sources of knowledge in the training, leveraging instead the shared contexts within the document collection itself via word and phrasal embeddings, mimicking a human that potentially reads through the documents in the collection and uses the seen information to make relevant concept tag judgments on unseen documents.

3) A mechanism providing a black-box approach for tagging any corpus of documents with meaningful concepts, treating it as a closed system. Thus, the concept associations can be pre-computed offline or periodically, as new documents are added to the collection.

Commercial Applications

  • Big Data
  • Search queries
  • (Text mining/Knowledge Discovery)


  • Complex relationship capture between entities
  • Concept understanding similar to humans
  • Minimal to no supervision, external knowledge resources required

Research Interests

Dr. Das received her Ph.D. from Ohio State University. Her research interests lie in exploring unsupervised and semi-supervised learning methods including neural models, for the effective transfer of knowledge in extracting key concepts that can enable "discovery" in downstream recommendation tasks like search, retrieval and question answering, using text mining, information extraction and natural language understanding technologies.

Loading icon