Nathan Schneider
7 September 2016
Some notes on Python 3, strings, regular expressions, doctests, and Jupyter.
For details, see sections 3.2–3.10 of Natural Language Processing with Python, Ch. 3: Processing Raw Text.
The most important things are:
print
is now a function! Parentheses mandatory.>>
thing, there is the file=
keyword argument./
always returns a float. This is probably what you meant.//
if you want integer division that throws out the remainder.range()
, enumerate()
, map()
, filter()
, and zip()
return a list. In 3.x, they return a lazy object for memory-efficient iteration. dict.items()
, dict.keys()
, dict.values()
list()
if you REALLY want a list.xrange()
no longer exists in Python 3.There is a snazzy script called 2to3
that will help you convert your 2.x
code into 3.x
. From now on, when you must write 2.x
code, always good practice to use from __future__ import print_function, division
(but this won't address all differences).
"Bob's favorite restaurant"
or 'My "favorite" restaurant'
'Bob\'s favorite restaurant'
, "My \"favorite\" restaurant"
' '
), tab ('\t'
), newline ('\n'
on Unix/Mac, '\r\n'
on Windows)r'\w+\s' == '\\w+\\s'
'My' + ' ' + 'racoon' == 'My racoon'
'tri' in 'string' == True
.count()
.index()
and .find()
.index()
raises an error if not found.find()
returns -1
if not found.isupper()
, .islower()
, .isalpha()
, .isalnum()
, .isdigit()
, .isspace()
.isupper()
: True
if at least 1 uppercase character and no lowercase characters; .islower()
is similar.replace('old', 'new')
, .upper()
, .lower()
, .split()
, .splitlines()
, delimiter.join(parts)
, .strip()
.replace()
return a new string; the original remains in memory..split()
splits on whitespace-sequences. Or you can specify a separator string, e.g., .split('\t')
to split only on tabs..join()
is the reverse of .split()
: it CREATES a single string out of a sequence (e.g., a list) of component strings. Usually called on a string literal. The sequence is given as the argument:' '.join(['My', 'racoon', 'has', 'hepatitis'])
gives 'My racoon has hepatitis'
''.join(['My', 'racoon', 'has', 'hepatitis'])
gives 'Myracoonhashepatitis'
.strip()
removes all whitespace at the beginning and end of the string.str(12)=='12' and int('12')==12
str
for text, bytes
for encoded textstr
, each element is a character, whether ASCII (appears on a standard U.S. keyboard) or not.bytes
literals are prefixed with b
. Each element is a byte.str
and bytes
requires an encodingutf-8
(a Unicode encoding). Unicode assigns a number ("code point") to pretty much any character you can imagine; UTF-8 is a way of converting code points into bytes.bytes('é£', 'utf-8')==b'\xc3\xa9\xc2\xa3'
str(b'\xc3\xa9\xc2\xa3', 'utf-8')=='é£'
open('myfile.txt', 'r', 'utf-8')
to open a file for reading.As noted above, it is recommended to use raw strings for regex pattern literals.
See the NLTK Book reading for the basics of regular expressions, including:
.
, *
, +
, and repetition ranges like {4,6}
a|b
(.*)(..)
[aeiou]
, [^A-Za-z]
, \d
, \w
, \b
, \s
and their complements (\S
etc.)re
module: .compile()
, .search()
, .match()
, .split()
, .sub()
, .finditer()
. Checking whether a string matches or not, and if it does, getting the full match as well as matched groups.With the doctest
module, you can write test cases inside function docstrings by writing >>>
followed by an input and expected output:
def fun(x):"""Check whether the last character in `x` is capitalized>>> fun('sdfsdf2094jsldkfX')True>>> fun('sdfsdf2094jsldk')False"""return x[-1].isupper()
Then run:
import doctestdoctest.testmod()
to run the tests. (In general, the actual output must EXACTLY match the expected output, but see the module documentation for options allowing flexibility with whitespace etc.) If any of the tests fail, a warning message is printed. This is highly recommended because it doubles as human-readable documentation of your code's behavior.
Jupyter, formerly IPython Notebook, offers browser-based sessions with the interactive Python interpreter. Sessions can be saved for later. Each session is organized by cells, which generally contain an input (code) and output (the value of an expression, and anything that is printed). These code cells are run one at a time. You can also make cells with textual notes. IPython and Jupyter are included in the Anaconda distribution.