GUCL: Computational Linguistics @ Georgetown
We are a group of Georgetown University faculty, student, and staff researchers at the intersection of language and computation. Our areas of expertise include natural language processing, corpus linguistics, information retrieval, text mining, and more. Members belong to the Linguistics and/or Computer Science departments.
GUCL holds monthly group meetings about research and maintains a mailing list for its members. (Contact Nathan Schneider to subscribe.) This website will also promote courses, talks, and other events on campus that relate to topics in computational linguistics.
- Grad CS research presentations: 9/15 11-1:15, 9/22 10-12:15, 10/5 11-1:15, all in STM 326
- Cristian Danescu-Niculescu-Mizil (Cornell): Linguistics 10/13/17
- Tim Finin (UMBC): CS 10/27/17
- Matthew Marge (ARL): CS 11/3/17
- Anna Feldman (Montclair State University): Linguistics 11/10/17
- Ben Carterette (Delaware): CS 11/10/17
- Ben Van Durme (JHU): CS 11/17/17
- Laura Dietz (UNH): CS 11/14/17
- Jordan Boyd-Graber (UMD): CS 12/1/17, 1:00
Hal Daumé (UMD)
CS Colloquium 10/14/16, 11:00 in St. Mary’s 326
Learning Language through Interaction
Machine learning-based natural language processing systems are amazingly effective, when plentiful labeled training data exists for the task/domain of interest. Unfortunately, for broad coverage (both in task and domain) language understanding, we're unlikely to ever have sufficient labeled data, and systems must find some other way to learn. I'll describe a novel algorithm for learning from interactions, and several problems of interest, most notably machine simultaneous interpretation (translation while someone is still speaking).
This is all joint work with some amazing (former) students He He, Alvin Grissom II, John Morgan, Mohit Iyyer, Sudha Rao and Leonardo Claudino, as well as colleagues Jordan Boyd-Graber, Kai-Wei Chang, John Langford, Akshay Krishnamurthy, Alekh Agarwal, Stéphane Ross, Alina Beygelzimer and Paul Mineiro.
Hal Daumé III is an associate professor in Computer Science at the University of Maryland, College Park. He holds joint appointments in UMIACS and Linguistics. He was previously an assistant professor in the School of Computing at the University of Utah. His primary research interest is in developing new learning algorithms for prototypical problems that arise in the context of language processing and artificial intelligence. This includes topics like structured prediction, domain adaptation and unsupervised learning; as well as multilingual modeling and affect analysis. He associates himself most with conferences like ACL, ICML, NIPS and EMNLP. He earned his PhD at the University of Southern California with a thesis on structured prediction for language (his advisor was Daniel Marcu). He spent the summer of 2003 working with Eric Brill in the machine learning and applied statistics group at Microsoft Research. Prior to that, he studied math (mostly logic) at Carnegie Mellon University. He still likes math and doesn't like to use C (instead he uses O'Caml or Haskell).
Yulia Tsvetkov (CMU/Stanford)
Linguistics Speaker Series 11/11/16, 3:30 in Poulton 230
On the Synergy of Linguistics and Machine Learning in Natural Language Processing
One way to provide deeper insight into data and to build more powerful, robust models is bridging between linguistic knowledge and statistical learning. I’ll present model-based approaches that incorporate linguistic knowledge in novel ways.
First, I’ll show how linguistic knowledge comes to the rescue in processing languages which lack large data resources. I’ll describe a new approach to cross-lingual knowledge transfer that models the historical process of lexical borrowing between languages, and I will show how its predictions can be used to improve statistical machine translation systems.
In the second part of my talk, I’ll argue that linguistic insight helps improve learning also in resource-rich conditions. I’ll present three methods to integrate linguistic knowledge in training data, neural network architectures, and into evaluation of word representations. The first method uses features quantifying linguistic coherence, prototypicality, simplicity, and diversity to find a better curriculum for learning distributed representations of words. Distributed representations of words capture which words have similar meanings and occur in similar contexts. With improved word representations, we improve part-of-speech tagging, parsing, named entity recognition, and sentiment analysis. The second describes polyglot language models, neural network architectures trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted. Finally, the third is an intrinsic evaluation measure of the quality of distributed representations of words. It is based on correlations of learned vectors with features extracted from manually crafted lexical resources. This computationally inexpensive method obtains strong correlation with performance of the vectors in a battery of downstream semantic and syntactic evaluation tasks. I’ll conclude with future research questions.
Yulia Tsvetkov is a postdoc in the Stanford NLP Group, where she works on computational social science with professor Dan Jurafsky. During her PhD in the Language Technologies Institute at Carnegie Mellon University, she worked on advancing machine learning techniques to tackle cross-lingual and cross-domain problems in natural language processing, focusing on computational phonology and morphology, distributional and lexical semantics, and statistical machine translation of both text and speech. In 2017, Yulia will join the Language Technologies Institute at CMU as an assistant professor.
Marine Carpuat (UMD)
Linguistics Speaker Series 11/18/16, 3:30 in Poulton 230
Toward Natural Language Inference Across Languages
Natural Language processing tasks as diverse as automatically extracting information from text, answering questions, translating or summarizing documents, all require the ability to compare and contrast the meaning of words and sentences. State-of-the-art techniques rely on dense vector representations which capture the distributional properties of words in large amounts of text in a single language. We seek to improve these representations to capture not only similarity in meaning between words or sentences, but also inference relations such as entailment and contradiction, and enable comparisons not only within, but also across languages.
In this talk, we will present novel approaches to inducing word representations from multilingual text corpora. First, we will show that translations in e.g. Chinese can be used as distant supervision to induce English word representations that can be composed into better representations of English sentences (Elgohary and Carpuat, ACL 2016). Then we will show how sparsity constraints can further improve word representations, and enable the detection not only semantic similarity (do "cure" and "remedy" have the same meaning?), but also entailment (does "antidote" entail "cure"?) between words in different languages (Vyas and Carpuat, NAACL 2016).
Marine Carpuat is an Assistant Professor in Computer Science at the University of Maryland, with a joint appointment at UMIACS. Her research interests are in natural language processing, with a focus on multilinguality. Marine was previously a Research Scientist at the National Research Council of Canada, and a postdoctoral researcher at the Columbia University Center for Computational Learning Systems. She received a PhD in Computer Science from the Hong Kong University of Science & Technology (HKUST) in 2008. She also earned a MPhil in Electrical Engineering from HKUST and an engineering degree from the French Grande Ecole Supélec.
Shomir Wilson (UC)
CS Colloquium 11/21/16, 11:00 in St. Mary’s 326
Text Analysis to Support the Privacy of Internet Users
Shomir Wilson is an Assistant Professor of Computer Science in the Department of Electrical Engineering and Computing Systems at the University of Cincinnati. His professional interests span pure and applied research in natural language processing, privacy, and artificial intelligence. Previously he held postdoctoral and lecturer positions in Carnegie Mellon University's School of Computer Science, and he spent a year as an NSF International Research Fellow in the University of Edinburgh's School of Informatics. He received his Ph.D. in Computer Science from the University of Maryland in 2011.
Mark Dredze (JHU)
CS Colloquium 11/29/16, 11:00 in St. Mary’s 326
Topic Models for Identifying Public Health Trends
Twitter and other social media sites contain a wealth of information about populations and has been used to track sentiment towards products, measure political attitudes, and study social linguistics. In this talk, we investigate the potential for Twitter and social media to impact public health research. Broadly, we explore a range of applications for which social media may hold relevant data. To uncover these trends, we develop new topic models that can reveal trends and patterns of interest to public health from vast quantities of data.
Mark Dredze is an Assistant Research Professor in Computer Science at Johns Hopkins University and a research scientist at the Human Language Technology Center of Excellence. He is also affiliated with the Center for Language and Speech Processing and the Center for Population Health Information Technology. His research in natural language processing and machine learning has focused on graphical models, semi-supervised learning, information extraction, large-scale learning, and speech processing. His focuses on public health informatics applications, including information extraction from social media, biomedical and clinical texts. He obtained his PhD from the University of Pennsylvania in 2009.
Mona Diab (GW)
CS Colloquium 12/2/16, 2:30 in St. Mary’s 414
Processing Arabic Social Media: Challenges and Opportunities
We recently witnessed an exponential growth in Arabic social media usage. Processing such media is of great utility for all kinds of applications ranging from information extraction to social media analytics for political and commercial purposes to building decision support systems. Compared to other languages, Arabic, especially the informal variety, poses a significant challenge to natural language processing algorithms since it comprises multiple dialects, linguistic code switching, and a lack of standardized orthographies, to top its relatively complex morphology. Inherently, the problem of processing Arabic in the context of social media is the problem of how to handle resource poor languages. In this talk I will go over some of our insights to some of these problems and show how there is a silver lining where we can generalize some of our solutions to other low resource language contexts.
Mona Diab is an Associate Professor in the Department of Computer Science, George Washington University (GW). She is the founder and Director of the GW NLP lab (CARE4Lang). Before joining GW, she was a Research Scientist (Principal Investigator) at the Center for Computational Learning Systems (CCLS), Columbia University in New York. She is also co-founder of the CADIM group with Nizar Habash and Owen Rambow, which is one of the leading places and reference points on computational processing of Arabic and its dialects. Her research interests span several areas in computational linguistics/natural language processing: computational lexical semantics, multilingual processing, social media processing, information extraction & text analytics, machine translation, and computational socio-pragmatics. She has a special interest in low resource language processing with a focus on Arabic dialects.
Joel Tetreault (Grammarly)
CS Colloquium 1/27/17, 11:00 in St. Mary’s 326
Analyzing Formality in Online Communication
Full natural language understanding requires comprehending not only the content or meaning of a piece of text or speech, but also the stylistic way in which it is conveyed. To enable real advancements in dialog systems, information extraction, and human-computer interaction, computers need to understand the entirety of what humans say, both the literal and the non-literal. This talk presents an in-depth investigation of one particular stylistic aspect, formality. First, we provide an analysis of humans' subjective perceptions of formality in four different genres of online communication. We highlight areas of high and low agreement and extract patterns that consistently differentiate formal from informal text. Next, we develop a statistical model for predicting formality at the sentence level, using rich NLP and deep learning features, and then evaluate the model's performance against human judgments across genres. Finally, we apply our model to analyze language use in online debate forums. Our results provide new evidence in support of theories of linguistic coordination, underlining the importance of formality for language generation systems.
This work was done with Ellie Pavlick (UPenn) during her summer internship at Yahoo Labs.
Joel Tetreault is Director of Research at Grammarly. His research focus is Natural Language Processing with specific interests in anaphora, dialogue and discourse processing, machine learning, and applying these techniques to the analysis of English language learning, automated essay scoring among others. Currently he works on the research and development of NLP tools and components for the next generation of intelligent writing assistance systems. Prior to joining Grammarly, he was a Senior Research Scientist at Yahoo Labs, Senior Principal Manager of the Core Natural Language group at Nuance Communications, Inc., and worked at Educational Testing Service for six years as a managing research scientist where he researched automated methods for essay scoring, detecting grammatical errors by non-native speakers, plagiarism detection, and content scoring. Tetreault received his B.A. in Computer Science from Harvard University and his M.S. and Ph.D. in Computer Science from the University of Rochester. He was also a postdoctoral research scientist at the University of Pittsburgh's Learning Research and Development Center, where he worked on developing spoken dialogue tutoring systems. In addition, he has co-organized the Building Educational Application workshop series for 8 years, several shared tasks, and is currently NAACL Treasurer.
Kenneth Heafield (Edinburgh)
CS Colloquium 2/2/17, 11:00 in St. Mary’s 326
Machine Translation is Too Slow
We're trying to make machine translation output less terrible, but we're impatient. A neural translation system took two weeks to train in 1996 and two weeks to train in 2016 because the field used twenty years of computing advances to build bigger and better models subject to the same patience limit. I'll talk about multiple efforts to make things faster: coarse-to-fine search algorithms and sparse gradient updates to reduce network communication.
Kenneth Heafield is a Lecturer (~Assistant Professor) in computer science at the University of Edinburgh. Motivated by machine translation problems, he takes a systems-heavy approach to improving quality and speed of neural systems. He is the creator of the widely-used KenLM library for efficient language modeling.
Margaret Mitchell (Google Research)
CS Colloquium 2/16/17, 11:00 in St. Mary’s 326
Algorithmic Bias in Artificial Intelligence: The Seen and Unseen Factors Influencing Machine Perception of Images and Language
The success of machine learning has recently surged, with similar algorithmic approaches effectively solving a variety of human-defined tasks. Tasks testing how well machines can perceive images and communicate about them have exposed strong effects of different types of bias, such as selection bias and dataset bias. In this talk, I will unpack some of these biases, and how they affect machine perception today. I will introduce and detail the first computational model to leverage human Reporting Bias—what people mention—in order to learn ground-truth facts about the visual world.
I am a Senior Research Scientist in Google's Research & Machine Intelligence group, working on advancing artificial intelligence towards positive goals, as well as ethics in AI and demographic diversity of researchers. My research is on vision-language and grounded language generation, focusing on how to help computers communicate based on what they can process. My work combines computer vision, natural language processing, social media, many statistical methods, and insights from cognitive science. Before Google, I was a founding member of Microsoft Research's "Cognition" group, focused on advancing vision-language artificial intelligence. Before MSR, I was a postdoctoral researcher at The Johns Hopkins University Center of Excellence, where I mainly focused on semantic role labeling and sentiment analysis using graphical models, working under Benjamin Van Durme. Before that, I was a postgraduate (PhD) student in the natural language generation (NLG) group at the University of Aberdeen, where I focused on how to naturally refer to visible, everyday objects. I primarily worked with Kees van Deemter and Ehud Reiter. I spent a good chunk of 2008 getting a Master's in Computational Linguistics at the University of Washington, studying under Emily Bender and Fei Xia. Simultaneously (2005 - 2012), I worked on and off at the Center for Spoken Language Understanding, part of OHSU, in Portland, Oregon. My title changed with time (research assistant/associate/visiting scholar), but throughout, I worked on technology that leverages syntactic and phonetic characteristics to aid those with neurological disorders under Brian Roark. I continue to balance my time between language generation, applications for clinical domains, and core AI research.
Glen Coppersmith (Qntfy & JHU)
CS Colloquium 2/24/17, 11:00 in St. Mary’s 326
Quantifying the White Space
Behavioral assessment and measurement today are typically invasive and human intensive (for both patient and clinician). Moreover, by their nature, they focus on retrospective analysis by the patient (or the patient’s loved ones) about emotionally charged situations—a process rife with biases, not repeatable, and expensive. We examine all the data in the “white space” between interactions with the healthcare system (social media data, wearables, activities, nutrition, mood, etc.), and have shown quantified signals relevant to mental health that can be extracted from them. These methods to gather and analyze disparate data unobtrusively and in real time enable a range of new scientific questions, diagnostic capabilities, assessment of novel treatments, and quantified key performance measures for behavioral health. These techniques hold special promise for suicide risk, given the dearth of unbiased accounts of a person’s behavior leading up to a suicide attempt. We are beginning to see the promise of using these disparate data for revolution in mental health.
Glen is the founder and CEO of Qntfy (pronounced “quantify”), a company devoted to scaling therapeutic impact by empowering mental health clinicians and patients with data science and technology. Qntfy brings a deep understanding of the underlying technology and an appreciation for the human processes these technologies need to fit in to in order to make an impact. Qntfy, in addition to providing analytic and software solutions, considers it a core mission to push the fundamental and applied research at the intersection of mental health and technology. Qntfy built the data donation site OurDataHelps.org to gather and curate the datasets needed to drive mental health research, working closely with the suicide prevention community. Qntfy was also 2015 Foundry Cup grand prize winner – a design competition seeking innovative approaches to diagnosing and treating PTSD.
Prior to starting Qntfy, Glen was the first full-time research scientist at the Human Language Technology Center of Excellence at Johns Hopkins University where he joined in 2008. His research has focused on the creation and application of statistical pattern recognition techniques to large disparate data sets for addressing challenges of national importance. Oftentimes, the data of interest was human language content and associated metadata. Glen has shown particular acumen for enabling inference tasks that bring together diverse and noisy data. His work spans from principled exploratory data analysis, anomaly detection, graph theory, statistical inference and visualization.
Glen earned his Bachelors in Computer Science and Cognitive Psychology in 2003, a Masters in Psycholinguistics in 2005, and his Doctorate in Neuroscience in 2008, all from Northeastern University. As this suggests, his interests and knowledge are broad, from computer science and statistics to biology and psychology.
Jeniya Tabassum (OSU)
GUCL 4/6/17, 2:00 in St. Mary’s 326
Large Scale Learning for Temporal Expressions
Temporal expressions are words or phrases that refer to dates, times or durations. Social media especially contains time-sensitive information about various events and requires accurate temporal analysis. In this talk, I will present our work on TweeTIME, a minimally supervised time resolver that learns from large quantities of unlabeled data and does not require any hand-engineered rules or hand-annotated training corpora. This is the first successful application of distant supervision for end-to-end temporal recognition and normalization. Our proposed system outperforms all previous supervised and rule-based systems in the social media domain. I will also present ongoing work applying deep learning methods for resolving time expressions and discuss opportunities and challenges that a deep learning system faces when extracting time sensitive information from text.
Jeniya Tabassum is a third year PhD student in the Department of CSE at the Ohio Sate University, advised by Prof Alan Ritter. Her research focuses on developing machine learning techniques that can effectively extract relevant and meaningful information from social media data. Prior to OSU, she received a B.S. in Computer Science and Engineering from Bangladesh University of Engineering and Technology.
Jacob Eisenstein (GA Tech)
Linguistics Speaker Series 4/21/17, 3:30 in Poulton 230
Social Networks, Social Meaning
Language is socially situated: both what we say and what we mean depend on our identities, our interlocutors, and the communicative setting. The first generation of research in computational sociolinguistics focused on large-scale social categories, such as gender. However, many of the most socially salient distinctions are locally defined. Rather than attempt to annotate these social properties or extract them from metadata, we turn to social network analysis, which has been only lightly explored in traditional sociolinguistics. I will describe three projects at the intersection of language and social networks. First, I will show how unsupervised learning over social network labelings and text enables the induction of social meanings for address terms, such as “Ms” and “dude”. Next, I will describe recent research that uses social network embeddings to induce personalized natural language processing systems for individual authors, improving performance on sentiment analysis and entity linking even for authors for whom no labeled data is available. Finally, I will describe how the spread of linguistic innovations can serve as evidence for sociocultural influence, using a parametric Hawkes process to model the features that make dyads especially likely or unlikely to be conduits for language change.
Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on computational sociolinguistics, social media analysis, discourse, and machine learning. He is a recipient of the NSF CAREER Award, a member of the Air Force Office of Scientific Research (AFOSR) Young Investigator Program, and was a SICSA Distinguished Visiting Fellow at the University of Edinburgh. His work has also been supported by the National Institutes for Health, the National Endowment for the Humanities, and Google. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award. Jacob's research has been featured in the New York Times, National Public Radio, and the BBC. Thanks to his brief appearance in If These Knishes Could Talk, Jacob has a Bacon number of 2.
Christo Kirov (JHU)
GUCL 4/28/17, 2:00 in St. Mary’s 250
Rich Morphological Modeling for Multi-lingual HLT Applications
In this talk, I will discuss a number of projects aimed at improving HLT applications across a broad range of typologically diverse languages by modeling morphological structure. These include the creation of a very large, normalized morphological paradigm database derived from Wiktionary, consensus-based morphology transfer via cross-lingual projection, and approaches to lemmatization and morphological analysis and generation based on recurrent neural network architectures. Much of this work falls under the umbrella of the UniMorph project at CLSP, led by David Yarowsky and supported by DARPA LORELEI, and was developed in close collaboration with John Sylak-Glassman.
Dr. Christo Kirov is a Postdoctoral Research Fellow at the Center for Language and Speech Processing at JHU, working with David Yarowsky. His current research combines novel machine learning approaches with traditional linguistics to represent and learn morphological systems across the world’s languages, and to leverage this level of language structure in Machine Translation, Information Extraction, and other HLT tasks. Prior to joining CLSP, he was a Visiting Professor at the Georgetown University Linguistics Department. He has received his PhD in Cognitive Science from Johns Hopkins University studying under Colin Wilson, with dissertation work focusing on Bayesian approaches to phonology and phonetic expression.
Bill Croft (UNM)
Linguistics 5/18/17, 1:00 in Poulton 230
Linguistic Typology Meets Universal Dependencies
Current work on universal dependency schemes in NLP does not make reference to the extensive typological research on language universals, but could benefit since many principles are shared between the two enterprises. We propose a revision of the syntactic dependencies in the Universal Dependencies scheme (Nivre et al. 2015, 2016) based on four principles derived from contemporary typological theory: dependencies should be based primarily on universal construction types over language-specific strategies; syntactic dependency labels should match lexical feature names for the same function; dependencies should be based on the information packaging function of constructions, not lexical semantic types; and dependencies should keep distinct the “ranks” of the functional dependency tree.
William Croft received his Ph.D. in 1986 at Stanford University under Joseph Greenberg. He has taught at the Universities of Michigan, Manchester (UK) and New Mexico, and has been a visiting scholar at the Max Planck Institutes of Psycholinguistics and Evolutionary Anthropology, and at the Center for Advanced Study in the Behavioral Sciences. He has written several books, including Typology and Universals, Explaining Language Change, Radical Construction Grammar, Cognitive Linguistics [with D. Alan Cruse] and Verbs: Aspect and Causal Structure. His primary research areas are typology, semantics, construction grammar and language change. He has argued that grammatical structure can only be understood in terms of the variety of constructions used to express functions across languages; that both qualitative and quantitative methods are necessary for grammatical analysis; and that the study of language structure must be situated in the dynamics of evolving conventions of language use in social interaction.
Spencer Whitehead (RPI)
GUCL 8/15/17, 11:00 in St. Mary’s 326
Multimedia Integration: Event Extraction and Beyond
Multimedia research is becoming increasingly important, as we are immersed in an ever-growing ocean of noisy, unstructured data of various modalities, such as text and images. A major thrust of multimedia research is to leverage multimodal data to better extract information, including the use of visual information to post-process or re-rank natural language processing results, or vice versa. In our work, we seek to tightly integrate multimodal information into a flexible, unified approach that jointly utilizes text and images. Here we focus on one application: improving event extraction by incorporating visual knowledge with words and phrases from text documents. Such visual knowledge provides a means to overcome the challenges that the ambiguities of language introduce. We first discover named visual patterns in a weakly-supervised manner in order to avoid the requirement of parallel/well-aligned annotations. Then, we propose a multimodal event extraction algorithm where the event extractor is jointly trained with textual features and visual patterns. We find improvements of 7.1% and 8.5% absolute F-score gain on event trigger and argument labeling, respectively. Moving forward, we intend to extend the idea of tight integration of multimodal information to other tasks, namely image and video captioning.
Spencer Whitehead is a PhD student in the Computer Science Department at Rensselaer Polytechnic Institute, where he is advised by Dr. Heng Ji. His interests broadly span Natural Language Processing, Machine Learning, and Computer Vision, but mainly lie in the intersection of these fields: multimedia information extraction and natural language generation from multimedia data. A primary goal of his work is to develop intelligent systems that can utilize structured, unstructured, and multimodal data to extract information as well as generate coherent, accurate, and focused text. Central to his research is the creation of novel architectures, deep learning or otherwise, which can properly incorporate such heterogeneous data. He received his Bachelors of Science degree in Mathematics and Computer Science from Rensselaer Polytechnic Institute with highest honors.