Language comes naturally to humans, but computers need to be able to unpack linguistic grammar and meaning to communicate intelligently.
Because we cannot directly see what abstractions humans rely on when processing language, there is a double mystery: which abstractions are necessary to describe linguistic communication, and how can they be imparted to computers at a scale commensurate with the range and richness of a natural language such as English?
My research addresses these questions through the design, annotation, and automation of broad-coverage, linguistically-based representations of meaning.
I enjoy collaborative research, as can be seen from my publications.
You can read an extended research overview or browse the selected topics below.
Broad-coverage semantic representations
Humans possess extensive vocabularies and talk about a wide variety of things, while the relationship between words/sentences and possible concepts/meanings is many-to-many. Therefore, a central challenge in computational semantics is to augment linguistic content with some representation that lends itself to useful inferences, such as judgments of paraphrase, similarity, or entailment. Ideally, these representations would allow a machine to conclude that The dog chased the cat is far more similar in meaning to The canine pursued the feline than The cat chased the dog. And the meaning of dog in those sentences should be different than in I ate a hot dog.
My work involves adapting descriptive techniques used in linguistics to make them practical for broad-coverage human annotation and machine learning. Rather than targeting a single meaning representation, I think it is useful to explore multiple levels of abstraction, including the lexical level, the proposition/event level, the sentence level, and the discourse level. The subsections below mention some of the representations I work on and with.
Lexical expressions and their semantic classes
To fully comprehend the sentence
A Junction City chocolate lab gave birth to 14 puppies!
it is necessary to recognize that Junction City is the name of a place, that a chocolate lab in this context is a kind of animal (not a confectionary research facility), and that gave birth indicates a biological event (not literally giving something to someone). In general, this sort of coarse-grained lexical grouping and disambiguation can be represented with manual or automatic annotations of multiword expressions and supersenses as introduced in my thesis work (Schneider & Smith 2015; Schneider et al. 2016).
Propositions and roles
The above example describes an event of giving birth in which there are two kinds of participants—the individual who gives birth, which in this sentence is the grammatical subject; and the newly born individual(s), which follow the preposition to. There are various ways to formalize these event–participant relationships. I lead one such research effort that focuses on the diverse semantic functions of English prepositions (Schneider et al. 2015, Schneider et al. 2016), and we are now extending the approach to accommodate adpositions and case markers in other languages. I also work on frame-semantic parsing (Das et al. 2010, Das et al. 2014, Kshirsagar et al. 2015), targeting the richer scene descriptions in FrameNet—see, for example, the Giving_birth frame.
The Abstract Meaning Representation (AMR; Banarescu et al. 2013) aims to encode all the semantic concepts and relations of a sentence in a directed graph (notably using PropBank propositions and roles). In LISP-like notation, an AMR graph for the above sentence would be:
(b / bear-02 :ARG0 (l / lab :location (c / city :name (n / name :op1 "Junction" :op2 "City")) :mod (c2 / chocolate)) :ARG1 (p / puppy :quant 14))To date, AMRs have been created for about 40,000 English sentences. I belong to AMR’s core design team and am interested in using the data to study the syntax–semantics interface, as well as building semantic parsers.
Analyzing the grammatical structure of a sentence is often a prerequisite to understanding its semantic structure. As existing syntactic analyzers tend to be specialized to news and other edited genres, I have worked on efforts to build annotated corpora and taggers/parsers for Twitter messages (Gimpel et al. 2011; Owoputi et al. 2013; Schneider et al. 2013; Kong et al. 2014). In doing so we developed annotation frameworks suited to the kind of data we were working with (tweets) and the kind of annotators we had (computer scientists). The resulting TweetNLP data and software have been widely used. I am also working on annotating transcripts of child-directed speech with Universal Dependencies.
For linguistic representations to be considered useful in NLP, they must generalize to real data (beyond toy examples). If multiple human annotators can apply a scheme to corpus data, with reasonably high inter-annotator agreement and reasonably low time/cost, that speaks to the representation’s practical value.
The methodology for developing high-quality annotated corpora is a research area in its own right. For example, informal text genres may require different methods from conventional edited text (Schneider 2015). And annotation need not be seen as fully human-directed: automatic techniques such as inconsistency detection (Hollenstein et al. 2016) could create a feedback loop which improves annotation quality.
Semi-supervised learning (and other paradigms)
Diverse linguistic resources are available for semantics, but they tend to be small, and expanding them manually can be costly. Thus, whereas many NLP tools assume complete supervision, I am interested in incomplete, indirect, or heterogeneous forms of supervision to obtain better models. This includes domain adaptation, exploiting data of a kind that exists in abudance to assist in lesser-resourced settings. In the past I have built NLP systems using latent variable models (Das et al. 2010), self-training (Mohit et al. 2012), cross-lingual projection (Schneider et al. 2013), unsupervised word representations (Owoputi et al. 2013; Qu et al. 2015), and pipelining of models trained on different datasets (Kong et al. 2014; Kshirsagar et al. 2015). Multi-task learning and active learning fit under this umbrella as well.
Languages other than English
Because people talk about the same things using different languages, it is to be expected that meaning representations in different languages should look similar even though the vocabulary and grammar may be very different. At the same time, languages impose nuances of meaning that are not necessarily preserved in translation. I am therefore interested in examining the ways in which linguistic representations need to be adapted to annotate data in multiple languages. Experience with Arabic Wikipedia, for example, suggests that a common set of coarse-grained lexical-semantic classes can be applied across languages, and lexical semantic analyzers can in fact exploit machine translation systems (Mohit et al. 2012, Schneider et al. 2012, Schneider et al. 2013). Currently, those of us who developed a broad-coverage annotation scheme for English prepositions are reviewing it in light of adposition/case marker behavior in other languages. I am also active in the Universal Dependencies community, which works to standardize syntactic annotation conventions (to the extent possible) across many languages.