A0: Text processing with Python and Unix

This assignment consists of exercises with the CMU Pronouncing Dictionary, a lexicon of English words with pronunciations. This lexicon can be used for applications such as text-to-speech (speech synthesis).

Download the dictionary from the link on the above website. The main data is in a file called cmudict-0.7b. Here is a sample:

OBTUSE  AA0 B T UW1 S
OBUCHOWSKI  OW0 B Y UW0 K AW1 S K IY0
OBUCHOWSKI(1)  OW0 B UW0 K AW1 S K IY0
OBUCHOWSKI(2)  OW0 B UW0 CH OW1 S K IY0

Each line consists of:

An English word (possibly a name) in all caps
If a word has multiple known pronunciations, these are listed in separate entries. A number in parentheses is added to all but the first disambiguate the entries.
The pronunciation is given with space-separated ARPAbet symbols. These symbols are explained in the cmudict documentation as well as in the inside of the back cover of your textbook. Vowels are accompanied with a stress level of 0, 1, or 2: in the OBTUSE entry, UW1 has greater stress than AA0 (that is, the second syllable is stressed).

There are two parts to this assignment. Part I requires you to manipulate the data with simple Unix commands. Part II requires you to write Python code. If you are new to Unix, you can start with Part II, as material relevant to Part I will be covered in lecture.

In general, it is good practice to use version control when developing code (as will be discussed in lecture). If you use a website like GitHub for this purpose, please make sure your code for homework assignments is in a private repository. You can create private repositories under the ENLP2016 group on GitHub; contact the instructor with your GitHub username if you'd like to be added. We still ask that you turn in your answers/code via Canvas.

Part I

These involve writing bash scripts. If you are a Mac OS user, you already have Unix shell access via the Terminal utility. If you are a Windows 10 user, you can follow these instructions to enable bash. Everyone enrolled in the class is also welcome to SSH into the Unix server cs-class.uis.georgetown.edu with your NetID and password. If you're unfamiliar with SSH, see Mark Maloof's guides for Windows and Mac/Unix.

1. Using Unix commands, convert cmudict to the following TSV format: WORD→entry_num_if_present→pronunciation, where → indicates a tab character. Only include entries where the word starts with a letter of the alphabet. The output should match this file: cmudict.tsv. Hint: This can be done with one grep command followed by one sed command. Put your commands in a bash script called a0.1.sh.

2. By piping together a series of Unix commands, list the pronunciations that are associated with more than one word, along with their word counts. (Words with the same pronunciation are called homophones.) Hint: Select only the pronunciation column of the TSV file, count unique entries, and filter out singletons with grep -v. Put your piped commands in a bash script called a0.2.sh.

3. Pipe together the cut and grep commands to list all words that can be polysyllabic. Polysyllabic = multiple syllabes in the pronunciation. Put your command in a bash script called a0.3.sh. In a comment in that script, explain what your regex does (there are multiple ways to interpret what counts as a syllable). By the data and definition of syllable you’re using, how many syllables are in the words: LIEUTENANT, TUITION, CHOIRS, FAMILY, INTERESTING, NORMALLY

Check that all three bash scripts that you have created can be run from the command line without error.

Part II

This can be done on any machine with Python 3, including the cs-class server described under Part I. On cs-class, you must run export PATH=/usr/local/anaconda3/bin:$PATH to set up Python 3. (This can go in your ~/.bash_profile so it will run automatically each time you log in.)

4. Write a Python script called tsv2json.py that converts cmudict.tsv to a file cmudict.json where each line is a JSON object: {"word": WORD, "pronunciations": [pron1, pron2, …]}, and each pronunciation is a list (not string!) of phones. The words/pronunciations should be given in the same order as in the input file. However, on each line, it doesn't matter whether the word or the pronunciations come first. Hint: Load the TSV file into appropriate Python objects, then use the json module to write the lines of output.

5. Write a Python script called rhymes.py that takes a single argument, which is an English word in its normal spelling, looks for all words that have a pronunciation which rhymes. This should work regardless of whether the input word is lowercase or uppercase. If the word is not found in cmudict the program should write Query word not found to stderr. Start with this code template (but you are free to add your own functions/global variables). Be sure to add some doctests where indicated. Explain in a comment how you operationalize what counts as a rhyme.

6. Add to your script a -p option that directs it to print the rhyming pronunciations instead of rhyming words. In this output, the ARPAbet symbols for VOWELS (AH, AE, etc.) should be replaced with their IPA equivalents. E.g., python rhymes.py -p "grew" should include S T u1 as one of its lines of output. Refer to the inside back cover of your textbook and https://en.wikipedia.org/wiki/Vowel_diagram for the Unicode IPA characters—but despite what the book says, substitute ɚ as the IPA symbol for ER. Leave the ARPAbet symbols for consonants.

7. How might you go about guessing rhymes for OOV words like NOTHINGBURGER and CRONUT? (Write the answer in a file called a0.7.txt; you don’t have to implement this.)

Submission

Package all your .sh, .py, and .txt files together into a file called a0.zip and submit it via Canvas. (No need to include the .tsv and .json files: we will regenerate them from your code.) The deadline is Friday 9/16 at 4:59pm. If for some reason you are having trouble submitting with Canvas, email the instructor and TA with the zip file as an attachment.