Research in Information Retrieval and Mining

Jump to...

Social Media and Mental Health
Domain Specific Search & Mining: Mining Social Media for Healthcare
Scientific Summarization
Clinical Decision Support Systems
Categorizing Erros in Clinical Care through Medical Narratives
Information Extraction
Opinion Mining and Sentiment Analyis of User Reviews
Query Session Analysis
Finding Category / Topic Relationships
Passage Detection
Feature Selection in Text Classification
Misuse (Off-Topic Search) Detection in Information Retrieval
Medical Informatics

Social Media and Mental Health

Social media has become a significant resource for improving healthcare and mental health. Users suffering from mental health conditions often turn to online resources for support, such as specialized support communities staffed by moderators who read the users’ posts and flag those posts that indicate a potential risk (e.g., the risk of self-harm). Users who do not participate in online support communities often still participate in more general social media communities, such as Twitter, Facebook, and Reddit. In this project, we explore methods and approaches for better understanding and identifying users with mental health conditions and analyzing user content severity. We propose an approach for triaging user content into four severity categories which are defined based on indication of self-harm ideation. We conduct various analysis on real-world data, providing more insight into addressing the current challenges in mental-health.

A. Cohan, B. Desmet, A. Yates, L. Soldaini, S. MacAvaney and N. Goharian, "SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions", COLING 2018. Selected as Area Chair Favorite.
L. Soldaini, T. Walsh, A. Cohan, J. Han, and N. Goharian, "Helping or Hurting? Predicting Changes in Users’ Risk of Self-Harm Through Online Community Interactions", CLPsych 2018.
S. MacAvaney, B. Desmet, A. Cohan, L. Soldaini, A. Yates, A. Zirikly, and N. Goharian, "RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses", CLPsych 2018.
A. Yates*, A. Cohan*, Nazli Goharian, "Depression and Self-Harm Risk Assessment in Online Forums", Empirical Methods in Natural Language Processing (EMNLP), 2017. *Equal contribution, EMNLP 2017 best long paper award.
A. Cohan, S. Young, A. Yates, N. Goharian, "Triaging Content Severity in Online Mental-Health Forums", Journal of the Association for Information Science and Technology (JASIST), Special Issue on Biomedical Information Retrieval, Volume 68, Issue 11, November 2017.
A. Cohan, S. Young, and N. Goharian, Triaging Mental Health Forum Posts, In Proceedings of the NAACL HLT 3rd Computational Linguistics and Clinical Psychology - From Linguistic Signal to Clinical Reality Workshop (CLPsych’16). June 2016.

Domain Specific Search & Mining: Mining Social Media for Healthcare

Online discussions of virtually all topics are increasing; this phenomenon is ever more so in the domain of healthcare. Mining social media presents its difficulty due to the lay people / informal language used. Thus, it is a challenge to identify variations of the same concept and map to the expert term in the available knowledge sources. In this project we proposed and evaluated synonym discovery methods and concept extraction methods. One of the important applications is to harness these publicly available statements, to further our knowledge and understanding about drug behavior. We focus on using several drug related and other general social media sites, query analysis, peer-to-peer, and Web sites to detect expected and unexpected adverse reaction to drugs and devices. To understand users intentions, we utilize consumer medical terminology from UMLS and various other approaches to generate an adverse reaction synonym set that we use to identify both expected adverse reactions, as already recorded by the FDA, and unexpected adverse reactions mentioned in online reviews. ADRs Background (drug) language is utilized to evaluate the strength of the detected unexpected ADRs. Existing synonym discovery methods perform poorly when faced with the realistic task of identifying a target term's synonyms from among many candidates. We approach domain-specific synonym discovery as a graded relevance ranking problem in which a target term's synonym candidates are ranked by their quality. In this scenario a human editor uses each ranked list of synonym candidates to build a domain-specific thesaurus. We evaluate our method for graded relevance ranking of synonym candidates and find that it outperforms existing methods. We reduce the impact of incomplete information by learning the relationship between user mentions of symptom,conditions and drugs. Furthermore, we utilize our synonym discovery and concept extraction methods to construct a framework to detect trends. We furthermore studied the effects of twitter sampling on the trend detection, as well as methods to quantify the influence of news cycle on social media activity.

A.Yates, A. Kolcz, N. Goharian and O. Frieder, "Effects of Sampling on Twitter Trend Detection", In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’16), May 2016.
A. Yates, J. Joselow, N. Goharian, "The news cycle's influence on social media activity", International AAAI Conference on Web and Social Media (ICWSM). May 2016.
A. Yates, N. Goharian, O. Frieder, Learning the relationships between drug, symptom, and medical condition mentions in social media", International AAAI Conference on Web and Social Media (ICWSM). May 2016.
J. Parker, A. Yates, N. Goharian, and O. Frieder, "Health Related Hypothesis Generation using Social Media Data," Journal of Social Network Analysis and Mining, Springer, 2015.
A. Yates, N. Goharian, O. Frieder, "Extracting Adverse Drug Reactions from Social Media", Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), 2015.
A. Yates, J. Parker, N. Goharian, and O. Frieder, A Framework for Public Health Surveillance, In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC.14), May 2014.
A. Yates, N. Goharian, O. Frieder, Relevance-Ranked Domain-Specific Synonym Discovery, in 36th European Conference on Information Retrieval (ECIR), April 2014.
J. Parker, Y. Wei, A. Yates, O. Frieder, N. Goharian, A Framework for Detecting Public Health Trends with Twitter., The 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, Aug. 2013.
A. Yates, N. Goharian, O. Frieder, .Extracting Adverse Drug Reactions from Forum Posts and Linking them to Drugs., SIGIR Workshop on Health Search and Discovery, July-Aug 2013.
A. Yates, N. Goharian, and O. Frieder, Graded Relevance Ranking for Synonym Discovery, 22nd International Conference on World Wide Web (WWW), May 2013 (short).
A. Yates and N. Goharian, ADRTrace:Detecting Expected and Unexpected Adverse Drug Reactions from User Reviews on Social Media Sites, in 35th European Conference on Information Retrieval (ECIR), 2013. (short)
A. Yates, N. Goharian, Mining Social Media for Healthcare, ICBI Biomedical Informatics Symposium at Georgetown University, Best Poster Award (1 out of 28), Oct 2012.

Scientific Summarization

Due to the expanding rate at which articles are being published in various scientific fields, it has become difficult for researchers to keep up with the new developments. Scientific summarization aims to facilitate this problem. One useful strategy for scientific summarization is citation based summarization in which citations to a reference article are used to generate the summary of the reference paper. While citations have been previously used in generating scientific summaries, they lack the related context from the referenced article and therefore do not accurately reflect the article’s content. Our goal is to overcome this problem by providing the appropriate context for the citations and utilize this information towards extractive summary of the article. We have also shown that using scientific article’s inherent discourse structure can help improving the quality of the generated summaries. We are currently investigating approaches for development of more robust general summarization and scientific summarization routines.

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian, "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", NAACL-HLT 2018.
Arman Cohan, Nazli. Goharian, "Scientific Document Summarization via Citation Contextualization and Scientific Discourse", International Journal of Digital Libraries, Special Issue on Bibliometric-Enhanced Information Retrieval and Natural Language Processing, Springer. May 2017.
Arman Cohan, and Nazli Goharian, "Contextualizing Citations for Scientific Summarization using Word Embeddings and Domain Knowledge", ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), Aug 2017. (short).
A. Cohan and N. Goharian, "Revisiting Summarization Evaluation for Scientific Articles", In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’16), May 2016.
A. Cohan and N. Goharian, "Scientific Article Summarization Using Citation-Context and Article's Discourse Structure" Empirical Methods for Natural Language Processing (EMNLP) 2015.
A. Cohan, L. Soldaini, and N. Goharian, "Matching Citation Text and Cited Spans in Biomedical Literature: a Search-Oriented Approach," NAACL-HLT, 2015.
A. Cohan, L. Soldaini, S. Mengle, and N. Goharian, "Towards Citation-Based Summarization of Biomedical Literature," Text Analysis Conference (TAC), 2014.

Clinical Decision Support Systems

Keeping current given the vast volume of medical literature published yearly poses a serious challenge for medical professionals. Thus, interest in systems that aid physicians in making clinical decisions is intensifying. We explore and evaluate approaches to retrieve relevant medical literature given a medical case report. Furthermore, given the action a health expert is seeking to complete (make a diagnosis, prescribe a treatment, or order a test), we investigate reranking techniques that could provide more appropriate literature.

L. Soldaini, A. Yates, N. Goharian, "Denoising Clinical Notes for Medical Literature Retrieval with Convolutional Neural Model", Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM), 2017.
L. Soldaini, A. Yates, N. Goharian, "Learning to Reformulate Long Queries for Clinical Decision Support", Journal of the Association for Information Science and Technology (JASIST), Special Issue on Biomedical Information Retrieval, Volume 68, Issue 11, November 2017.
L. Soldaini and N. Goharian, "QuickUMLS: a fast, unsupervised approach for medical concept extraction", In Proceedings of the Medical Information Retrieval (MedIR) workshop at SIGIR 2016, July 2016.
L. Soldaini, A. Cohan, A. Yates, N. Goharian, O. Frieder, "Retrieving Medical Literature for Clinical Decision Support", Proceedings of the 37th European Conference on Information Retrieval (ECIR 2015), 2015.
L. Soldaini, A. Cohan, A. Yates, N. Goharian, O. Frieder, "Query Reformulation for Clinical Decision Support Search", Proceedings of the 23rd Text REtrieval Conference Proceedings (TREC), 2015.
A. Cohan, L. Soldaini, A. Yates, N. Goharian, and O. Frieder, "On clinical decision support", Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2014. (short)

Categorizing Erros in Clinical Care through Medical Narratives

There is an increasing demand for use of electronic health records and clinical texts, for reasons such as improving health care, public health surveillance, quality measures, and improving medical education. Text categorization and classification is a fundamental task in understanding, mining, and analyzing medical text and it can benefit applications that improve healthcare in general. In this project, we focus on developing text categorization methods to address some of the real-world challenges in healthcare.

Arman Cohan, Allan Fong, Raj Ratwani, and Nazli Goharian, Identifying Harm Events in Clinical Care through Medical Narratives, ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2017)
Arman Cohan, Allan Fong, Nazli Goharian, and Raj Ratwani, A Neural Attention Model for Categorizing Patient Safety Events, European Conference on Information Retrieval (ECIR 2017)
Arman Cohan, Luca Soldaini, Nazli Goharian, Allan Fong, Ross Filice, Raj Ratwani Identifying Significance of Discrepancies in Radiology Reports, Workshop on data Mining for Medicine and Healthcare (DMMH) at SDM 2016

Information Extraction

Ziling Fan, Arman Cohan, Luca Soldaini, Nazli Goharian, “Relation Extraction for Protein-protein Interactions Affected by Mutations”, The 9th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB), Aug 2018.
S. MacAvaney, L. Soldaini, A. Cohan, and N. Goharian, "Tree-LSTMs for Scientific Relation Classification", International Workshop on Semantic Evaluation (SemEval 2018).
S. MacAvaney, A. Cohan, and N. Goharian, "A Framework for Cross-Domain Clinical Temporal Information Extraction", International Workshop on Semantic Evaluation (SemEval 2017).
A. Cohan, K. Meurer, and N. Goharian, "Temporal Information Processing for Clinical Narratives", International Workshop on Semantic Evaluation (SemEval 2016).
L. Ma, N. Goharian, A. Chowdhury, M. Chung, "Extracting Unstructured Data From Template Generated Web Documents", ACM 12th Conference on Information and Knowledge Management (CIKM), November 2003.

Opinion Mining and Sentiment Analysis of User Reviews; Joint collaboration with Orbitz Worldwide

User reviews are commonly used by both the Web users and the providers of goods and services on the Web. Thus, analyzing and understanding users. reviews plays a pivotal role for the decision making process of both parties. As the popularity of online user reviews continues to increase, it is becoming increasingly difficult for potential customers and even business owners to understand what aspects business reviewers cared about and how the reviewers felt about those aspects. Many websites allow and even encourage people to submit reviews of various products and services. The text within these reviews often contains valuable information not found in a single 1-5 "star rating". This research proposed and evaluated a novel approach to efficiently model and analyze the text within user reviews to estimate how much reviewers care about different aspects of a product (i.e., amenities, food, location, room, etc. of a hotel) by estimating the aspects' weights. A vector of aspect weights synthesizes the average customer's preferences and expectations as well as the product's actual performance, thus providing a way to characterize the subject of the reviews. This approach performs statistically similar to, and arguably better than, the best existing method, but with significantly lower computational complexity (linear time). While the current domain of this research is a hotel review data set, this method is not domain-specific and should work for other types of reviews.

A. Yates, N. Goharian, W. Yee, . Semi-supervised Sentiment Analysis: Merging Labeled Sentences with Unlabeled Reviews to Identify Sentiment., American Society for Information Science and Technology (ASIST), Nov. 2013.
J. Parker, A. Yates, N. Goharian, W.-G. Yee, Efficient Estimation of Aspect Weights., In proceedings of ACM 35th Conference on Research and Development in Information Retrieval (SIGIR), August 2012.

Query Session Analysis

We developed and evaluated our approach that utilized our earlier research on identifying the relationships among topics, now to understand the topic of user queries and intent given sequence of user queries from a session or sessions. The context of the session queries is utilized to improve the effectiveness of identifying the intent or topic of current query. Earlier efforts utilized fixed number of preceding queries to derive such contextual information. We proposed and evaluated an approach (DQW) that identifies a set of "unambiguous" preceding queries in a dynamically determined window to utilize in classifying an ambiguous query to a topic. Furthermore, utilizing a relationship-net (R-net) that represents relationships among known topics, we improved the classification effectiveness for those ambiguous queries whose predicted topic in this relationship-net is related to the topic of a query within the window. Our results indicated that the hybrid approach (DQW+R-net) statistically significantly improves the Conditional Random Field (CRF) query classification approach when static query windowing and hierarchical taxonomy are used (SQW+Tax), in terms of precision (10.8%), recall (13.2%), and F1 measure (11.9%). The findings of this research can improve our understanding of user query intent and consequently the search results.

N. Goharian, S. Mengle, Context Aware Query Classification Using Dynamic Query Window and Relationship Net, In proceedings of ACM 33rd Conference on Research and Development in Information Retrieval (SIGIR), July 2010.
D. Guan, H. Yang, N. Goharian, Effective Structured Query Formulation for Session Search, TREC 2012. Query Session Track.

Finding Category / Topic Relationships

The hierarchical nature of existing Web directories, ontologies, and folksonomies, are known to provide meaningful information that guide users and applications. Knowledge of relationships among text categories is of interest in different domains such as text classification, content analysis, text mining, query [session] understanding. Knowledge of relationships among categories is of the interest in different domains such as text classification, content analysis, and text mining. We propose and evaluate approaches to effectively identify relationships among document categories. Our proposed novel method capitalizes on the misclassification results of a text classifier to identify potential relationships among categories. This leads to a relationship network. We demonstrate that our system detects such relationships, even those relationships that assessors failed to identify in manual evaluation. Furthermore, we favorably compare the effectiveness of our methods with the state of art method and demonstrate a significant improvement in precision and recall. Furthermore, we are interested to discover interesting relationships in the existing hierarchical knowledge representations. The hierarchical nature of existing Web directories, ontologies, and folksonomies, are known to provide meaningful information that guide users and applications. We hypothesize that such hierarchical structures provide richer information if they are further enriched by incorporating additional links besides parents, and siblings, namely, between non-sibling nodes. We call such structure a "networked hierarchy". Our empirical results indicate that such a networked hierarchy introduces interesting links between nodes (non-sibling) that otherwise in a hierarchical structure are not evident. This research findings can be utilized to improve and maintain the existing hierarchies, construct topic hierarchies or networks, and improve our understanding of topic hierarchies in text search and query session research.

N. Goharian, S. Mengle, Networked Hierarchies for Web Directories, 20th International World Wide Web conference (WWW), March 2011
S. Mengle and N. Goharian, Detecting Relationships among Categories using Text Classification., Journal of American Society for Information Science and Technology (JASIST), 61 (5), May 2010.
S. Mengle, N. Goharian, A. Platt, Discovering Relationships among Categories using Misclassification Information., ACM 23rd Symposium on Applied Computing (SAC), March 2008.

Passage Detection

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organization. We explore the methodology to detect such hidden passages within a document. A document is divided into passages using various document splitting techniques, and a text classifier is used to categorize such passages. We present a novel document splitting technique called dynamic windowing, which significantly improves precision, recall and F1 measure.

S. Mengle and N. Goharian, "Passage Detection Using Text Classification", Journal of American Society for Information Science and Technologyi (JASIST), 60(4) March 2009.
N. Goharian, S. Mengle, On Document Splitting in Passage Detection. In proceedings of 31st Conference on Research and Development in Information Retrieval (ACM SIGIR 2008), July 2008.
S. Mengle, N. Goharian, Detecting Hidden Passages from Documents. In proceedings of SIAM Conference on Data Mining (SDM 2008) Workshop, April 2008.

Feature Selection in Text Classification

With the ever-increasing number of documents on the web, digital libraries, news sources, etc., the need of a text classifier that can classify massive amount of data is becoming more critical and difficult. The major problem in text classification is the high dimensionality of feature space. The Support Vector Machine (SVM) classifier is shown to perform consistently better than other text classification algorithms. However, the time taken for training a SVM model is more than other algorithms. We explore the use of the Ambiguity Measure (AM) feature selection method that uses only the most unambiguous keywords to predict the category of a document. Our analysis shows that AM reduces the training time by more than 50% than the scenario when no feature selection is used, while maintaining the accuracy of the text classifier equivalent to or better than using the whole feature set. We empirically show the effectiveness of our approach in outperforming seven different feature selection methods using two standard benchmark datasets

S. Mengle and N. Goharian, "Ambiguity Measure Feature Selection Algorithm", Journal of American Society for Information Science and Technology (JASIST), 60(5), April 2009.
S. Mengle, N. Goharian, Using Ambiguity Measure Feature Selection Algorithm for Support Vector Machine Classifier, ACM 23rd Symposium on Applied Computing (SAC), March 2008.
S. Mengle, N. Goharian, A. Platt, "FACT: Fast Algorithm for Categorizing Text" IEEE 5th International Conference on Intelligence and Security Informatics(ISI), May 2007.

Misuse (Off-Topic Search) Detection in Information Retrieval

Most computer crime traditionally has been the "insider" problem. In fact after virus, i.e., malicious code, insider abuse, called misuse, is the second most threatening attack. We focus the problem on misuse of search systems. Misuse detection is an attack to the system by an authorized user who is misusing their privileges. Prior work on misuse detection mainly focused on using logs and user profiles. Profile-based detection systems audit the deviation of user activities from normal user profiles. A user's command history is reviewed based on the percentage of commands used over a specific period of time and logs are mined. We developed algorithms and implemented a misuse detection system by comparing user behavior to user interest profile learned through clustering, relevance feedback, and finally fusion of results of these methods. We evaluated our system by setting up both an automatic and manual (four human evaluators) evaluation systems and showed a significant improvement in detection rate.

A. Platt, S. Mengle, N. Goharian, Improving Classification Based Off-topic Search Detection via Category Relationships, ACM 24th Symposium on Applied Computing (SAC), March 2009.
N. Goharian and A. Platt, DOTS: Detection of Off-Topic Search Via Result Clustering. IEEE 5th International Conference on Intelligence and Security Informatics (ISI), May 2007.
A. Platt, N. Goharian, S. Mengle, Using User Query Sequence to Detect Off-Topic Search., ACM 22nd Symposium on Applied Computing (SAC), March 2007.
N. Goharian, L. Ma, Off-Topic Access Detection In Information Systems, ACM 14th Conference on Information and Knowledge Management (CIKM), November 2005.
N. Goharian, L. Ma, C. Meyer, .Detecting Misuse of Information Retrieval Systems Using Data Mining Techniques. IEEE International Conference on Intelligence and Security Informatics. May 2005.
N. Goharian, L. Ma, "Query Length Impact on Misuse Detection in Information Retrieval Systems", ACM 20th Symposium on Applied Computing (SAC), March 2005.
L. Ma and N. Goharian, "Using Relevance Feedback to Detect Misuse in Information Retrieval Systems" ACM 13th Conference on Information and Knowledge Management (CIKM), November 2004.
R. Cathey, L. Ma, N. Goharian, D. Grossman, "Misuse Detection for Information Retrieval Systems", ACM 12th Conference on Information and Knowledge Management (CIKM), November 2003.

Medical Informatics

One of the areas in Medical Informatics concerns searching biomedical literature, which differs from conventional search in that the vocabulary (terms) involve significantly different grammatical structures. Suffixes vary in nature; synonyms are far more common; the reliance of taxonomies is far greater. This research addresses issues raised by this domain. Another area is Data collection for clinical research, which is quite fragmented in the field of medicine. Issues of administrative hurdles associated with multi-center studies, error free data collection, automated analysis, and increased collaboration among different medical research centers is of concern. The database and data mining techniques allow for error free and automated analysis. In collaboration with the Northwestern Medical School, we designed and developed a computer-assisted medical application that captured data needed to study the effectiveness of the diagnosis and treatment of Urinary Tract Infections (UTIs). Furthermore, we develpped a data collection and analysis system, via which the application of LithoTron® lithotripter on the patients with kidney stone were analyzed. Yet another area of my recent interest in Medical Informatics relates to mining social media for finding patterns and opinions of patients on various treatments of specific dieseases.

J. Urbain, O. Frieder, N. Goharian, Passage relevance models for genomics search., (Journal of) BMC Bioinformatics, 10(Suppl 3):S3 (19 March 2009).
J. Urbain, O. Frieder, and N. Goharian, "A Dimensional Retrieval Model for Integrating Semantics and Statistical Evidence in Context for Genomics Literature Search, "(Journal of) Computers in Biology and Medicine 39 (1), January 2009.
J. Urbain, N. Goharian, and O. Frieder, "Probabilistic Passage Models for Semantic Search of Genomics Literature," Journal of the American Society of Information Science and Technology, 59 (12), September 2008.
J. Urbain, N. Goharian, O. Frieder, IIT TREC-2007 Genomics Track: Using Concept-based Semantics in Context for Genomics Literature Passage Retrieval., Proceedings of the Fourteenth Text REtrieval Conference,November 2007.
J. Urbain, N. Goharian, O. Frieder, IIT TREC-2006: Genomics Track, Proceedings of the Fifth Text REtrieval Conference, November 2006.
J. Urbain, N. Goharian, O. Frieder, IIT TREC-2005: Genomics Track, Proceedings of the Fourteenth Text REtrieval Conference, November 2005.
P. Jain, N. Goharian, A. Weiser, S. Kimm, S. Kim, J. Stern, J. Pazona, C. Wambi, R. Yap, L. Blunt, and R. Nadler, Efficiency and Safety of the Healthtronics LithoTron® Lithotripter., Journal of EndoUrology, 18 (1), January/February 2004.
N. Goharian, P. Jain, G. Kora, A. Jain .A Web-Based Medical Diagnosis and Treatment System for Urinary Tract Infections., The 2002 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS'02) June 2002.