Python Notes Accompanying ENLP Lecture 2

Nathan Schneider
7 September 2016; updated 18 January 2018

Some notes on Python 3, strings, regular expressions, doctests, and Jupyter.

For details, see sections 3.2–3.10 of Natural Language Processing with Python, Ch. 3: Processing Raw Text.

For Python 2 users: What's new in 3?

The most important things are:

  1. Strings and Unicode work differently. See below.
  2. print is now a function! Parentheses mandatory.
    • Instead of the >> thing, there is the file= keyword argument.
  3. The division operator / always returns a float. This is probably what you meant.
    • Use // if you want integer division that throws out the remainder.
  4. In 2.x, range(), enumerate(), map(), filter(), and zip() return a list. In 3.x, they return a lazy object for memory-efficient iteration.
    • Same: dict.items(), dict.keys(), dict.values()
    • This is because you probably only need one element at a time. Wrap with list() if you REALLY want a list.
    • xrange() no longer exists in Python 3.

There is a snazzy script called 2to3 that will help you convert your 2.x code into 3.x. From now on, when you must write 2.x code, always good practice to use from __future__ import print_function, division (but this won't address all differences).

Strings

What can you do with strings?

Strings in Python 3

Regular Expressions

As noted above, it is recommended to use raw strings for regex pattern literals.

See the NLTK Book reading for the basics of regular expressions, including:

Doctests

With the doctest module, you can write test cases inside function docstrings by writing >>> followed by an input and expected output:

def fun(x):
"""Check whether the last character in `x` is capitalized
>>> fun('sdfsdf2094jsldkfX')
True
>>> fun('sdfsdf2094jsldk')
False
"""
return x[-1].isupper()

Then run:

import doctest
doctest.testmod()

to run the tests. (In general, the actual output must EXACTLY match the expected output, but see the module documentation for options allowing flexibility with whitespace etc.) If any of the tests fail, a warning message is printed. This is highly recommended because it doubles as human-readable documentation of your code's behavior.

Jupyter

Jupyter, formerly IPython Notebook, offers browser-based sessions with the interactive Python interpreter. Sessions can be saved for later. Each session is organized by cells, which generally contain an input (code) and output (the value of an expression, and anything that is printed). These code cells are run one at a time. You can also make cells with textual notes. IPython and Jupyter are included in the Anaconda distribution.