Python Notes Accompanying ENLP Lecture 2

Nathan Schneider
7 September 2016; updated 18 January 2018

Some notes on Python 3, strings, regular expressions, doctests, and Jupyter.

For details, see sections 3.2–3.10 of Natural Language Processing with Python, Ch. 3: Processing Raw Text.

For Python 2 users: What's new in 3?

The most important things are:

Strings and Unicode work differently. See below.
print is now a function! Parentheses mandatory.
- Instead of the >> thing, there is the file= keyword argument.
The division operator / always returns a float. This is probably what you meant.
- Use // if you want integer division that throws out the remainder.
In 2.x, range(), enumerate(), map(), filter(), and zip() return a list. In 3.x, they return a lazy object for memory-efficient iteration.
- Same: dict.items(), dict.keys(), dict.values()
- This is because you probably only need one element at a time. Wrap with list() if you REALLY want a list.
- xrange() no longer exists in Python 3.

There is a snazzy script called 2to3 that will help you convert your 2.x code into 3.x. From now on, when you must write 2.x code, always good practice to use from __future__ import print_function, division (but this won't address all differences).

Strings

A string is a sequence of characters.
String literal in quotes (single or double): "Bob's favorite restaurant" or 'My "favorite" restaurant'
Backslash-escapes: 'Bob\'s favorite restaurant', "My \"favorite\" restaurant"
Whitespace: typically we encounter the space (' '), tab ('\t'), newline ('\n' on Unix/Mac, '\r\n' on Windows)
Raw strings: backslash is actually a backslash (it only acts as an escape for the quote delimiter)
- Mainly useful for regular expressions: r'\w+\s' == '\\w+\\s'
Triple-quoted strings
- Can span multiple lines
- Useful for code documentation ("docstring")

What can you do with strings?

You can concatenate strings and index, slice, and iterate over characters just like elements of a list
- 'My' + ' ' + 'racoon' == 'My racoon'
Check substring containment like list containment: 'tri' in 'string' == True
Count substring occurrences: .count()
Find substring position: .index() and .find()
- .index() raises an error if not found
- .find() returns -1 if not found
Test whether a string has only certain kinds of characters: .isupper(), .islower(), .isalpha(), .isalnum(), .isdigit(), .isspace()
- Note: .isupper(): True if at least 1 uppercase character and no lowercase characters; .islower() is similar
String transformations: .replace('old', 'new'), .upper(), .lower(), .split(), .splitlines(), delimiter.join(parts), .strip()
- Strings are immutable: an instance of a string cannot be modified. Methods like .replace() return a new string; the original remains in memory.
- With no arguments, .split() splits on whitespace-sequences. Or you can specify a separator string, e.g., .split('\t') to split only on tabs.
- .join() is the reverse of .split(): it CREATES a single string out of a sequence (e.g., a list) of component strings. Usually called on a string literal. The sequence is given as the argument:
  - ' '.join(['My', 'racoon', 'has', 'hepatitis']) gives 'My racoon has hepatitis'
  - ''.join(['My', 'racoon', 'has', 'hepatitis']) gives 'Myracoonhashepatitis'
- .strip() removes all whitespace at the beginning and end of the string.
Convert to/from a number: str(12)=='12' and int('12')==12

Strings in Python 3

Two kinds: str for text, bytes for encoded text
- In a str, each element is a character, whether ASCII (appears on a standard U.S. keyboard) or not.
  - The string 'é£' consists of 2 non-ASCII characters.
- bytes literals are prefixed with b. Each element is a byte.
  - A byte-string is said to be encoded because its interpretation as text-characters depends on an encoding.
- You should avoid working with bytes unless absolutely necessary
Converting between str and bytes requires an encoding
- Usually, we assume utf-8 (a Unicode encoding). Unicode assigns a number ("code point") to pretty much any character you can imagine; UTF-8 is a way of converting code points into bytes.
  - bytes('é£', 'utf-8')==b'\xc3\xa9\xc2\xa3'
  - str(b'\xc3\xa9\xc2\xa3', 'utf-8')=='é£'
- Mainly this comes up with files, which are stored as bytes.
- To read from or write to a file, the encoding determines how the bytes are read as characters: e.g. open('myfile.txt', 'r', 'utf-8') to open a file for reading with the UTF-8 encoding.
  - UTF-8 is probably the default encoding for your Python 3 installation. To verify, open the interactive Python prompt and type: import locale; locale.getpreferredencoding(False)
  - 'r' is the default mode for opening a file, so if your system default encoding is UTF-8, open('myfile.txt', 'r', 'utf-8') can be simplifed to open('myfile.txt').

Regular Expressions

As noted above, it is recommended to use raw strings for regex pattern literals.

See the NLTK Book reading for the basics of regular expressions, including:

The operators ., ?, *, +, and repetition ranges like {4,6}
Disjunctions: a|b
Capturing groups: (.*)(..)
Character classes like [aeiou], [^A-Za-z], \d, \w, \b, \s and their complements (\S etc.)
Indicating the start and end of the string
How to use the re module: .compile(), .search(), .match(), .split(), .sub(), .finditer(). Checking whether a string matches or not, and if it does, getting the full match as well as matched groups.

Doctests

With the doctest module, you can write test cases inside function docstrings by writing >>> followed by an input and expected output:

def fun(x):
    """Check whether the last character in `x` is capitalized

    >>> fun('sdfsdf2094jsldkfX')
    True

    >>> fun('sdfsdf2094jsldk')
    False
    """
    return x[-1].isupper()

Then run:

import doctest
doctest.testmod()

to run the tests. (In general, the actual output must EXACTLY match the expected output, but see the module documentation for options allowing flexibility with whitespace etc.) If any of the tests fail, a warning message is printed. This is highly recommended because it doubles as human-readable documentation of your code's behavior.

Jupyter

Jupyter, formerly IPython Notebook, offers browser-based sessions with the interactive Python interpreter. Sessions can be saved for later. Each session is organized by cells, which generally contain an input (code) and output (the value of an expression, and anything that is printed). These code cells are run one at a time. You can also make cells with textual notes. IPython and Jupyter are included in the Anaconda distribution.