Are you a Python hacker? This is a scattered list of patterns, tools, and conventions that I have found useful and thought were worth sharing. It is not a tutorial, though excellent ones exist.
Data structures
taxonomy of basic data structures
- Collections
- Unordered
- Unique elements:
set
(mutable),frozenset
(immutable) - Repeated elements:
Counter
for multiset/bag
- Unique elements:
- Ordered
- Unique elements: OrderedSet
- Repeated elements:
list
(mutable),tuple
(immutable), anddeque
(like a list, but with efficient access to the beginning and end such as would be desirable for a queue or stack)
- Unordered
- Mappings, i.e. key-value pairs (one-to-many); keys are hashable objects
- General-purpose:
dict
- unordered keys
- fully mutable—keys/values can be added, modified, and removed
- lookup on a new key is an error
- Ordered keys:
OrderedDict
(by original insertion order) - Predefined inventory of ordered keys:
namedtuple
(creates a class that instantiates tuple-like objects with named fields) - Non-overrideable values: FixedDict (values for existing keys cannot be reassigned; see below)
- Mapping ranges to values: BetweenDict (items are stored by nonoverlapping numeric ranges, and lookup is on a key falling within some range; see below)
- Default values:
defaultdict
,Counter
(lookup on a new key returns a default value—in the latter case, 0)
- General-purpose:
FixedDict
: preventing reassignment to keys
Sometimes it is desirable to create a mapping which enforces the requirement that values cannot be reassigned. This is easily done by subclassing the dict
class:
class FixedDict(dict): '''Dict subclass which prevents reassignment to existing keys.''' def __setitem__(self, key, newvalue): if key in self: raise KeyError('FixedDict cannot reassign to key {0!r} (current: {1!r}, new value: {2!r})'.format(key,self[key],newvalue)) dict.__setitem__(self, key, newvalue)
Thus, the last line of the following will trigger an error:
d = FixedDict() d['a'] = 1 d['b'] = 2 d['a'] = 3
We can make our class slightly fancier to allow several options for handling of reassignment attempts: with the version below,
FixedDict(reassign='exception')
will create an object behaving as above;
FixedDict(reassign='ignore')
will cause reassignment attempts to fail silently; and
FixedDict(reassign='different')
will cause an error to be thrown only if the new value and the
old value are not equal to each other. A function can even be passed to customize the reassignment behavior.
class FixedDict(dict): '''Dict subclass which constrains reassignment to existing keys.''' def __init__(self, *args, **kwargs): ''' Optional argument 'reassign' to specify what to do if __setitem__() is invoked for a key that is already present: * reassign='exception' indicates that an error should be thrown (default) * reassign='ignore' indicates that the new value should be ignored without error * reassign='replace' indicates that the new value should replace the old one without error * reassign='different' indicates that an error should be thrown if the new value is different from the old one; if they are the same, the new value will be ignored silently * reassign=<2-arg function> indicates that the key will be reassigned to the value returned by the function when applied to the old and new values The reassignment policy can also be specified as an argument to __setitem__(), overriding any instance-level policy. ''' reass = kwargs.get('reassign','exception') assert reass in ('exception','ignore','replace','different') or callable(reass),"Invalid 'reassign' parameter: {0}".format(reass) self._reassign = reass if 'reassign' in kwargs: del kwargs['reassign'] dict.__init__(self, *args, **kwargs) def __setitem__(self, key, value, reassign=None): if key in self: return self.reassign(key, value, policy=reassign) dict.__setitem__(self, key, value) def reassign(self, key, newvalue, policy=None): p = policy or self._reassign if p=='exception' or (p=='different' and newvalue!=self[key]): raise KeyError('FixedDict cannot reassign to key {0!r} (current: {1!r}, new value: {2!r})'.format(key,self[key],newvalue)) elif p=='ignore': pass elif p=='replace': dict.__setitem__(self, key, newvalue) elif callable(p): dict.__setitem__(self, key, p(self[key], newvalue))
BetweenDict
: mapping ranges to values
Sometimes information is organized by ranges of numeric values; thus we want to store the information compactly though lookup will be on a single value at a time. The following data structure accomplishes this, assuming ranges do not overlap:
class BetweenDict(dict): def __init__(self, d = {}): for k,v in d.items(): self[k] = v def __getitem__(self, key): for k, v in self.items(): if k[0] <= key < k[1]: return v raise KeyError("Key '%s' is not between any values in the BetweenDict" % key) def __setitem__(self, key, value): try: if len(key) == 2: if key[0] < key[1]: dict.__setitem__(self, (key[0], key[1]), value) else: raise RuntimeError('First element of a BetweenDict key ' 'must be strictly less than the ' 'second element') else: raise ValueError('Key of a BetweenDict must be an iterable ' 'with length two') except TypeError: raise TypeError('Key of a BetweenDict must be an iterable ' 'with length two') def __contains__(self, key): try: return bool(self[key]) or True except KeyError: return False
source: Joshua Kugler's blog
tuple/dict/set literals
{}
is an empty dict, not a set. For an empty set, useset()
.()
is an empty tuple.(x)
is the expressionx
.(x,)
is the tuple containing a single element,x
.
list/dict/set comprehensions and generator expressions
List comprehensions are syntactic sugar for creating a list via a loop:
evens = [i*2 for i in range(6)] # [0, 2, 4, 8, 10] charcodes = [chr(ord(c)+1) for c in 'ABCabc'] # ['B', 'C', 'D', 'b', 'c', 'd'] fours = [j for j in evens if j%4==0] # [0, 4, 8]
Recent versions of Python also support dict and set comprehensions:
d = {'one': 1, 'three': 3, 'two': 2} inverted = {v: k for k,v in d.items()} # {1: 'one', 3: 'three', 2: 'two'} halved = {k.upper(): v/2 for k,v in d.items() if v%2==0} # {'TWO': 1} uniqchars = {c.lower() for c in 'OneTwoThree'} # {'e', 'h', 'n', 'o', 'r', 't', 'w'}
Generator expressions are like comprehensions, but create a generator rather than a new data structure in memory. This is useful for calling functions that accept iterable arguments, and gives rise to an elegant expression of the count-if pattern:
sum(1 for x in items if condition(x))
Cf. sections on generators, iteration patterns, and iteration/sequence operations.
» Jonathan Elsas points out that, since generator expressions were introduced in Python 2.4, the following alternatives to dict and set comprehensions work even for Python <2.7:
inverted = dict((v, k) for k,v in d.items()) uniqchars = set(c.lower() for c in 'OneTwoThree')
sequence literal repetition operator
Shorthand for concatenating something with itself a given number of times:
[None]*3 == [None, None, None] # True (1, 2, 3)*2 == (1, 2, 3, 1, 2, 3) # True 'c'+'o'*10+'l!'*4 == 'cooooooooool!l!l!l!' # True
You probably want to avoid repeating mutable items, because there will be multiple references to the same instance:
x = [[]]*5 x[0].append(1) x == [[1], [], [], [], []] # False! x == [[1], [1], [1], [1], [1]] # True
(Suggested by Jonathan Elsas.)
built-in iteration/sequence operations
selecting a single element
max(a, b[, c[, ...]][, key=func])
ormax(iter[, key=func])
- returns the maximum of several items. A 1-argument function to return a comparison key (as in
list.sort()
andsorted()
) can be supplied viakey=func
. min(...)
- analogous to
max(...)
.
checking elements
all(iterable)
- returns whether all elements of the argument are true
any(iterable)
- returns whether any elements of the argument are true
aggregating elements
reduce(function, iter[, initializer])
sum(iter[, start])
iterator protocol
iter(obj[, sentinel])
- returns an iterator via
obj.__iter__()
unless sentinel is provided, in which caseobj
is called with no arguments for each step of iteration and iteration terminates once it returns the value ofsentinel
next(iter[, default])
- equivalent to
iter.next()
, but returnsdefault
if the end of iteration has been reached
producing a new sequence or modifying the iteration
enumerate(sequence[, start=0])
- returns an indexed sequence
zip([iter[, ...]])
reversed(sequence)
- constructs a reverse iterator
sorted(iter[, cmp[, key[, reverse]]])
- returns a sorted list
filter([function,] iterable)
- returns a list containing the elements e of iterable for which
function(e)
is true (or which are true themselves, iffunction
is omitted) map(function, iterable, ...)
range([start], stop[, step])
- creates a list with the specified integer progression
xrange([start], stop[, step])
- behaves like
range()
with respect to iteration but doesn't store all the values simultaneously
Cf. itertools module.
the inverse of zip(...)
is zip(*...)
x = zip([1,2,3], 'abc', 'ABC') x==[(1,'a','A'), (2,'b','B'), (3,'c','C')] # True y = zip(*x) y==[(1,2,3), ('a','b','c'), ('A','B','C')] # True
functional operations between sets
In addition to methods like .union()
, .intersection()
, and .difference()
there are operator versions:
hi = set('hello') bye = set('goodbye') # all True: hi & bye=={'e', 'o'} hi | bye=={'b', 'd', 'e', 'g', 'h', 'l', 'o', 'y'} hi ^ bye=={'b', 'd', 'g', 'h', 'l', 'y'} hi - bye=={'h', 'l'}
breaking a sequence into parts of length n
parts = [seq[i:i+n] for i in xrange(0, len(seq), n)]
sequence comparison
Sequences are compared element-wise. More precisely, seq1 < seq2
pseudocode is as follows:
for i in range(len(seq1)): if i>=len(seq2): return False elt1 = seq1[i] elt2 = seq2[i] if elt1==elt2: continue elif elt1<elt2: return True else: # ASSUMES elt2<elt1! return False if len(seq2)>len(seq1): return True return False
In other words, when comparing or sorting sequences, it is assumed that exactly one of the following is true for a given pair of elements: a<b
, a==b
, b<a
namedtuple
creation decorator
A shortcut for defining namedtuple
types with a more Pythonic syntax:
from collections import namedtuple import inspect def make_namedtuple(fn): args = ' '.join(inspect.getargspec(fn).args) return namedtuple(fn.__name__, args)
Usage example:
@make_namedtuple def Point(x, y): pass p = Point(1, 2) # Point(x=1, y=2)
non-mutating version of dict.update()
Extending dict
with a non-mutating update operation:
class Xdict(dict): def __lshift__(x, y): return Xdict(x.items() + y.items()) d = Xdict({'a': 9, 'b': 12}) d << {'a': 3, 'c': 8} # result: {'a': 3, 'b': 12, 'c': 8} d['a']==9 # True
(Original suggestion: [1])
Overloading <<
with these semantics is a new idiom, but I think the symbol <<
plausibly denotes a sort of modified concatenation/asymmetric union operation. Reasons to prefer this over other options:
- The thread proposes a non-mutating method called
replace()
to perform the same functionality. The disadvantage there, I think, is that it will be hard to remember whetherreplace()
is mutating (an alias forupdate()
) or not. Infix operators, on the other hand, are by convention non-mutating. __or__
(|
) is classically a commutative operation, and is already overloaded byset
andCounter
.__add__
(+
) is quite common, and thus banning the addition of twodict
s probably catches a lot of bugs. Moreover, whendict
values are numeric, a user might expect thatdict1 + dict2
would add corresponding keys' values (as it would if they wereCounter
s).Counter
does not already overload<<
, and thus could be extended to support the same behavior.
Variables & Types
built-in types in Python 2.7
These come with corresponding conversion functions. See also Built-in Types.
basestring, bool, complex, dict, file, float, frozenset, int, list, long, memoryview, object, set, slice, str, tuple, type, unicode
checking for built-in types
Preferred methods for checking if a variable is a number, a string, an iterable, etc.:
from numbers import Number from collections import Iterable def describeType(x): s = str(type(x)) if x is None: s += ' : None!' elif x is True or x is False: s += ' : a boolean!' elif isinstance(x, Number): s += ' : a number!' elif isinstance(x, basestring): # Python 2.x; simply use str in 3.x (all strings are Unicode) s += ' : a (possibly Unicode) string!' if callable(x): s += ' : a callable, such as a function!' if hasattr(x, '__iter__'): s += ' : a non-string iterable, such as a collection!' if isinstance(x, Iterable): s += ' : any iterable, including strings and collections!'
To check whether a variable's type is any of several possibilities,
pass a tuple of types as the second argument to isinstance()
:
if isinstance(x, (Number, basestring)): # a number or string
Unicode and Python strings
Tutorial: "Unicode in Python, Completely Demystified" (Kumar McMillan, PyCon 2008)
To summarize: in Python 2.7,
- There are two types of strings—plain strings (type
str
) and Unicode strings (unicode
). Plain strings consist of bytes, whereas Unicode strings allow for multi-byte characters (represented with\u####
escapes). - Unless
from __future__ import unicode_literals
is specified, theu
prefix is necessary to specify a Unicode string literal. str.decode('utf-8')
deciphers a UTF-8–encoded bytestring into a Unicode string.unicode.encode('utf-8')
does the reverse.- Use the
codecs
module to open a file for reading directly into Unicode strings. - To check whether a variable is a string (of either type), use
isinstance(x,basestring)
- If you attempt to concatenate a plain string with a Unicode string, it will implicitly attempt to decode the plain string as ASCII. This will fail if it has encoded non-ASCII characters.
s = '\xe2\x96\xaa' # UTF-8 bytes u = u'\u25aa' # s or u will print as: ▪ u==s.decode('utf-8') # True s==u.encode('utf-8') # True len(s)==3 # True len(u)==1 # True s==t # False; UnicodeWarning (comparing str with unicode) v = '\u25aa' v=='\\u25aa' # True (i.e. \u escape is meaningless for plain strings) s.encode('utf-8') # UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128) isinstance(u,str) # False isinstance(u,unicode) # True isinstance(u,basestring) # True isinstance(s,basestring) # True s+u # UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Python 3 makes Unicode strings the default for working with string literals/text files, and provides a bytes
type for plain bytestrings.
built-in string functions
format(value[, format_spec])
repr(obj)
intern(str)
- return the interned version of the string
character codes
chr(ascii_code)
unichr(unicode_codepoint)
ord(char)
- inverse of
chr()
/unichr()
string, regular expression references
- sequence types
- re module
- Regexp Syntax Summary (also Grep, Emacs, Perl, Tcl)
- string & regexp API cheatsheet (also Java, Javascript, PHP, Ruby)
- format string reference
variable scoping
See "Understanding UnboundLocalError in Python".
Essentially: with respect to assignment, variables not declared as global
—or (in Python 3.x), nonlocal
—are local to the scope in
which they are assigned. Variables local to an enclosing scope can be accessed:
x = 0 l = [] def f0(): l.append(x) f0() l==[0] # True
However, they cannot be assigned to—the following results in a different, local variable which shadows the first one:
x = 0 def f1(): x = 1 # a different 'x' f1() x==0 # True
Moreover, a variable name in a given scope cannot refer variously to an enclosing-scope variable and to a local variable shadowing it:
x = 0 def f2(): print(x) # refers to first 'x' x = 2 # attempt to create a second 'x' def f3(): x = x + 1 # attempt to refer to the first 'x' (RHS), then create a second 'x' def f4(): x += 1 # equivalent to x = x + 1 def f5(): x = 5 # global, because 'global x' occurs in this scope global x f2() # UnboundLocalError f3() # UnboundLocalError f4() # UnboundLocalError f5() x==5 # True
naming conventions
I like to use the following conventions for variable names where I think it will help avoid confusion:
Suffix | Meaning | Examples |
---|---|---|
S | string (esp. in contrast to a numeric value or list) | dataS |
T | template string or regex pattern | |
L | list | |
C | Counter | |
F | file | inF , outF |
FP | file path (as a string) | inFP , outFP |
R | file reader (e.g. for CSV files) | inR |
W | file writer (e.g. for CSV files) | outW |
X | function (e.g., in an argument list) |
(Suffixes I haven't settled on yet, but might: D
for dict or directory or distribution, P
for probability or prior, G
for graph, N
for node, I
for integer, U
for tuple, M
for markup or map, V
for vector/matrix, KV
for key-value pairs (such as are returned by dict.items()
), Q
for query string, etc.)
(Possible convention: Double the suffix to indicate a variable may be a single instance or a collection, such as in a parameter list. For instance, inFF
would be "input file or files".)
I often prefix variable names with n
if they are counts: e.g. nItems
"number of items".
Infixed 2
is short for "to" in mappings (dicts or conversion functions) like name2id
.
I also typically reserve the variable names s
for a string, f
for a file, d
for a dict, m
for a regular expression match, and ex
for an exception.
Doubling a single-character variable name indicates a collection of them: mm
is a list of matches, for example. For variable names that are words, pluralize them to indicate a collection.
Expressions & Statements
iteration patterns for
everyone
# seq is a sequence (like a list, tuple, or string) for x in seq: pass for i,x in enumerate(seq): pass # i is the element index for i in range(len(seq)): pass for x in sorted(seq): pass # iterates through contents in sort order for x,y in zip(seq1,seq2): pass # iterates until the end of the shortest sequence is reached # d is a mapping type (like a dict) for k in d: pass # k is the entry key for v in d.values(): pass for k,v in d.items(): pass for i,(k,v) in enumerate(d.items()): pass # note that the ordering of items may be arbitrary (depending on the mapping type) for k,v in sorted(d.items(), key=lambda kv: kv[0]): pass # sort order of keys for k,v in sorted(d.items(), key=lambda kv: kv[1]): pass # sort order of values
See also: comprehensions and generator expressions, generators, iteration/sequence operations
debugging with assert
Python's assert
statement is great for debugging.
In the basic case, it lets you specify assumptions about the state of your program that are not otherwise checked
but could lead to subtle bugs (say, if user code is unaware of a function's expectations, or if you forget your own assumptions down the road).
For example:
assert x>=1 assert len(items1)==len(items2),'Length mismatch: {} and {}'.format(len(items1),len(items2))
The optional expression following the comma is a value to be printed along with the AssertionError
that is raised if your condition fails.
Callers can intercept AssertionError
s, but otherwise they will terminate the execution of your code prematurely like any other exception.
This is another way in which they are handy for quick debugging: for example, if I want to add a bunch of print statements and then
have the program terminate, I use
assert False
or if I simply want to print a couple of variables,
assert False,(x,y)
This is less clunky than adding return or exit statements, and is readily identifiable as debugging code because
assert False
would not normally occur as part of a program.
Note that assert False,x
is equivalent to assert False
if x
is
None
or ()
—no message is added to the AssertionError
, which can be misleading.
A solution is to enclose the mysterious expression in a tuple: assert False,(x,)
will always display the value of x
.
To print an empty or complicated string for debugging purposes, convert it to its literal form with repr(s)
.
boolean expression values
When it comes to truth values in Python, False
,
0
, None
, ''
,
()
, and []
are all false. In fact, any instance
of a class that defines a __len__()
method evaluates as false if its length is 0.
The value of a boolean expression is not always of type bool
; rather,
it is the value of the last operand that was evaluated. All of the following are true:
(None or 'a')=='a' ('a' or None)=='a' # short-circuited: None is never evaluated because 'a' is a true value ('a' and None) is None ('a' and False) is False
This property gives us a shorthand for expressions of the form a-if-a-is-true-else-b:
x = y if y else z # can be rewritten as: x = y or z
The file will automatically be closed at the end of the block, even if an exception is encountered.
fancy uses of with
Nested with
blocks can be exploited to represent
structured content, such
as XML (discussion).
Cf. PEP 359, which proposes new syntax for this sort of functionality.
built-in math functions
abs(num)
divmod(a, b)
- returns (essentially)
(a//b, a%b)
pow(x, y[, z])
round(x[, n])
max()
,min()
,sum()
- see iteration/sequence operations
base conversion
bin(int)
- convert to a binary string
oct(int)
- convert to an octal string
hex(int)
- convert to a hexadecimal string
Cf. math module.
Functions
generators are awesome
Generators are functions that do not return but yield
values; they are designed to be iterated over lazily.
Calling the function returns a generator instance, which maintains the internal state of a use of the function.
Iterating over that instance (in a for
loop or by explicitly calling next()
)
will resume evaluation of the function until the next yield
statement is reached, returning the yielded value.
After the generator instance has been exhausted it will raise a StopIteration
exception, per the iterator protocol.
Notably, lazy evaluation with generators can make code more efficient by avoiding the creation of intermediate data structures. They are especially powerful when chained together. See this tutorial for an in-depth discussion. An example from slide I-39:
wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes)
Rather than creating and operating over lists, we are operating directly over generators
(here, generator expressions) arranged in a pipeline.
For very large files, there is a major performance advantage because items are "pulled" through the pipeline one by one for processing.
They do not need to be kept around after being tallied by the loop within the sum()
function.
def foo(*args, bar='baz')
is illegal, though intuitive
…because named arguments can also be provided positionally (without the keyword), which means a call foo(x)
would be ambiguous between foo(args=(x,))
and foo(args=(), bar=x)
. (The same holds true if there are named arguments before *args
.) The workaround is to use **kwargs
:
def foo(*args, **kwargs): bar = 'baz' # default if 'bar' in kwargs: bar = kwargs['bar'] del kwargs['bar'] assert len(kwargs)==0 # check that illegal arguments haven't been provided # main body
We can write a decorator to reduce the amount of code required for this case (as far as I know there is no equivalent in the standard library):
from functools import wraps def xkwargs(**defaults): def wrap(fxn): @wraps(fxn) def _(*args, **kwargs): for k in kwargs: assert k in defaults,'Invalid keyword argument: {}'.format(k) d = dict(defaults) d.update(kwargs) return fxn(*args, **d) return _ return wrap @xkwargs(bar='baz') def foo(*args, **kwargs): bar = kwargs['bar'] # main body
If we want to allow extra keyword arguments besides the ones with defaults, functools.partial
almost fits the bill. We can instead define the decorator as follows:
from functools import partial, update_wrapper def xkwargs(**defaults): def wrap(fxn): return update_wrapper(partial(fxn, **defaults), fxn) return wrap
foo
will then be a partial
object which delegates to the original function, filling in with defaults as necessary. The defaults will live in foo.keywords
.
keyword arguments are great for rapid prototyping
Often it is impossible to predict the twists and turns of how functionality will evolve as a piece of code is being developed. Even libraries with general-purpose utility functions will evolve as user code wants finer-grained options. Keyword arguments allow great flexibility in this context, because it's almost always possible to add (optional) keyword arguments with defaults that preserve previous behavior but add new functionality. Adding parameters to a function is as easy as adding attributes or methods to a class.
For example, I have a function for drawing F score contours on a precision-vs.-recall plot with matplotlib. It started off like this:
def fcurves(): from pylab import ogrid, divide, clabel, contour, plot X, Y = ogrid[0:1:.001,0:1:.001] # range of R and P values, respectively. X is a row vector, Y is a column vector. F = divide(2*X*Y, X+Y) # matrix s.t. F[P,R] = 2PR/(P+R) plot(X[...,0], X[...,0], color='#cccccc') # P=R # show F score curves at values .5, .7, and .9 clabel(contour(X[...,0], Y[0,...], F, levels=[.5,.7,.9], colors='#aaaaaa', linewidths=2), fmt='F=%.1f', inline_spacing=1)
This was sufficient for what I needed at the time. But later I needed to make a similar plot, and found this function, but wanted slightly different functionality. Rather than replace it, I opted to generalize it via keyword arguments:
def fcurves(levels=[.5,.7,.9], lblfmt='F=%.1f'): from pylab import ogrid, divide, clabel, contour, plot X, Y = ogrid[0:1:.001,0:1:.001] # range of R and P values, respectively. X is a row vector, Y is a column vector. F = divide(2*X*Y, X+Y) # matrix s.t. F[P,R] = 2PR/(P+R) plot(X[...,0], X[...,0], color='#cccccc') # P=R # show F score curves at specified levels clabel(contour(X[...,0], Y[0,...], F, levels=levels, colors='#aaaaaa', linewidths=2), fmt=lblfmt, inline_spacing=1)
Calling the function without arguments produces the same result as before! To the extent that this customization power (a) does not add too much additional complexity and (b) will be reused in the future, it is an improvement over the original implementation of the function, without breaking backward compatibility.
Warning: When adding new keyword arguments to a function, be sure to check that they are passed appropriately in any recursive calls!
Classes & Objects
built-in object operations
callable(obj)
- returns whether the argument is callable
cmp(x,y)
(comparison)- strictly positive if
y>x
, negative ify<x
hash(obj)
(hashing)id(obj)
(identifier)- returns an integer id which is unique for the object instance during its lifetime
isinstance(obj, classinfo)
issubclass(cls, classinfo)
len(obj) (length)
super(type[, obj_or_type])
type(object)
- returns the type of an object
type(name, bases, dict)
- constructs a type (class) with the specified name, superclasses, and contents
functions pertaining to attributes (members)
hasattr(obj, name)
getattr(obj, name[, default])
- equivalent to
obj.<name>
, with an optional default value in case the attribute does not exist setattr(obj, name, value)
delattr(obj, name)
- equivalent to
del obj.<name>
property([fget[, fset[, fdel[, doc]]]])
- class property attribute with the given getter, setter, and deleter
decorators for methods/properties
@staticmethod
- decorates a member function that does not take an instance argument
@classmethod
- decorates a member function that is like a static method, but takes the type of the class as its first parameter
@property
- decorates a function whose name is the name of a property and which defines the getter for that property
@x.setter
- decorates a function whose name is property
x
and which defines a setter for that property @x.deleter
- decorates a function whose name is property
x
and which defines a deleter for that property
decorator to store constructor arguments as attributes
Implementation and examples here. For instance,
class Person(object): def __init__(self, first, last, email, dob, role='student'): self.name = first + ' ' + last self.email = email self.dob = dob self.role = role
can be simplified to
class Person(object): @autoassign('email','dob','role') def __init__(self, first, last, email, dob, role='student'): self.name = first + ' ' + last
or even
class Person(object): @autoassign(exclude=('first','last')) def __init__(self, first, last, email, dob, role='student'): self.name = first + ' ' + last
Simply using @autoassign
(with no arguments) stores all constructor arguments as attributes.
adding bound methods to instances
passing methods to higher-order functions
Jonathan Elsas points out that, because object methods take self
as their first parameter, an expression like ' abc '.strip()
can be rewritten as str.strip(' abc ')
. Why is this useful? Because often it is desirable to invoke instance methods indirectly via higher-order functions, decoupling the method name from the call. For example:
import operator ops = [str.strip, str.split, operator.itemgetter(0), str.lower] for op in ops: data = map(op, data)
Note the use of the operator
module, which provides convenience methods for passing to higher-order functions operations that would normally be encoded with special syntax (operators).
super()
, multiple inheritance, and method resolution order
Execution & I/O
with
statement is now the preferred way to open files
with open('file.txt') as f: dostuff(f)
built-in I/O, execution environment, and reflection functions
print([object, ...][, sep=' '][, end='\n'][, file=sys.stdout])
input([prompt])
- expects a valid Python expression as input
raw_input([prompt])
open(filename[, mode[, bufsize]])
- creates a
file
object help([obj])
dir([obj])
- returns a list of names in
obj
, or in the current scope if no argument is provided globals()
locals()
vars([obj])
__import__(...)
reload(module)
eval(expression[, globals[, locals]])
execfile(filename[, globals[, locals]])
compile(source, ...)
file and stream handling libraries
special files
In addition to the built-in open()
there are ways to open special kinds of files:
In Python 2.x, codecs.open()
is for encoded text files:
import codecs with codecs.open('file.txt', 'r', 'utf-8') as f: for ln in f: process(ln) # 'ln' is automatically decoded into Unicode from UTF-8
gzip.open()
is for gzipped files.
The tempfile
module can generate temporary files/directories.
filesystem access
os
provides basic filesystem support. Of particular note:
os.chdir(path)
changes the current working directoryos.getcwd()
returns the path to the current working directoryos.listdir(path)
returns a list of names of items in the provided directoryos.mkdir(path[, mode])
creates a directoryos.walk(top[, topdown=True[, onerror=None[, followlinks=False]]])
recursively lists directories and files in the provided directory and subdirectories- The
os.path
submodule handles path manipulations, notably:abspath()
,basename()
,dirname()
,exists()
,isfile()
,isdir()
,join()
, andsplit()
shutil
offers additional support for copying and (re)moving files.
Unix-style pattern matching is supported for names of files (fnmatch
) and paths (glob
).
fileinput
[suggested by Jonathan Elsas]: iterate through lines of input from files (typically specified as script arguments) or standard input.
import fileinput, sys # reads from files matched by all but the first 2 arguments, as well as all .txt files for ln in fileinput.input(sys.argv[3:]+['*.txt']): print(fileinput.filename()) process(ln)
# reads from stdin for ln in fileinput.input([]): process(ln)
# defaults to sys.argv[1:], or sys.stdin if no arguments for ln in fileinput.input(): process(ln)
# read input encoded as UTF-8, whether in sys.stdin or a file specified as an argument import codecs sys.stdin = codecs.getreader("utf-8")(sys.stdin) for ln in fileinput.input(openhook=fileinput.hook_encoded("utf-8")): process(ln)
Cf. io
.
INI-style configuration files
The ConfigParser
module provides functionality for working with configuration files of the format traditionally associated with the .ini extension. This format groups option assignments under one or more section headers (in brackets). Option values are then accessed by section.
Comments begin with a semicolon. Default options (implicitly part of a DEFAULT
section) can be provided to the constructor. Option names are case-insensitive, though section names are case-sensitive. The SafeConfigParser
class, which supports interpolation (variable substitution), is recommended. (Interpolation does not cross sections, though defaults are available from all sections.)
Supposing the contents of opts.ini are as follows:
[Input] dir = ./in file = %(dir)s/in.txt ; The interpolation syntax is %(...)s [Output] file = %(dir)s/out.txt ; Uses default value of 'dir' [Hyperparams] ; These values can be tuned alpha = 0.5 k = 30 optimize = no evaluate = FALSE
This file can be read as follows:
import ConfigParser config = ConfigParser.SafeConfigParser({'dir': '.'}) config.read('opts.ini') config.sections()==['Input','Output','Hyperparams'] config.get('Input', 'dir')=='./in' config.get('Input', 'file')=='./in/in.txt' config.get('Output','dir')==config.get('DEFAULT','dir')=='.' config.get('Output', 'file')=='./out.txt' config.options('Hyperparams')==['alpha','k','dir'] config.items('Hyperparams')==[('dir', '.'), ('alpha', '0.5'), ('k', '30')] config.items('Hyperparams', vars={'k': '100'})==[('dir','.'), ('alpha','0.5'), ('k', '100')] # With override config.getint('Hyperparams','k')==30 config.getfloat('Hyperparams','alpha')==0.5 config.getboolean('Hyperparams','optimize')==config.getboolean('Hyperparams','evaluate')==False
Classes in the ConfigParser
module can also be used to create configuration files.
PHP provides similar (but more limited) functionality in parse-ini-file()
: interpolation and default options are not supported. The PHP implementation additionally supports array-valued options.
executing other processes: subprocess
The subprocess
module is used to spawn new processes (e.g. system commands) and pipe to/from them. It is quite complex; see this tutorial for the important aspects.
runtime CPU profiling: cProfile
See the Instant User's Manual. (Recommended by Jonathan Elsas.)
catching Ctrl+C
try: # stuff... except KeyboardInterrupt: # Ctrl+C handling
interactive post-execution for debugging
From Ned Batchelder's blog:
The -i switch to the Python interpreter will run your program, then leave you in the command prompt when it ends. This can be good for interactively testing the functions defined in the file, or for debugging if an exception happens.
If you use "python -i" and need to debug, pdb.pm() is the thing to use. It places you at the point where the last traceback was raised. It seems like magic, since the exception has already risen to the top level of the program, but pdb.pm() puts you "back in time" before the exception started its climb up through the stack.
Source files
file headers
My Python source files start something like this:
#coding=UTF-8 ''' Utilities for scoring/evaluating data (human annotations or system output). Plotting functions require the 'pylab' library (includes numpy and matplotlib). @author: Nathan Schneider (nschneid) @since: 2010-08-25 ''' # Strive towards Python 3 compatibility from __future__ import print_function, division, absolute_import from future_builtins import map, filter, zip
The first line is to indicate to the interpreter that there may be Unicode characters in the source file (e.g. in comments). Next is the docstring describing the module. Finally, when writing 2.7 code, there are several imports to enable Python 3 behavior: print()
as a function rather than a statement; real division (never integer division) with the /
operator; etc. This will make it easier to port the code without bugs.
Additionally, for text processing scripts I typically use the following:
import os, sys, re, codecs, fileinput from collections import defaultdict, Counter
Specialty modules
general
threecheck
- a package (included in Python 3.x) which can perform runtime type checking on function arguments/return valuesargparse
- library in Python 2.7 and 3.2 for managing command-line argumentsdoctest
- built-in library that makes unit testing dead simple: it allows test cases to be embedded within docstringspymysql
- interface to MySQL- NLTK - toolkit for natural language processing
boto
- library for working with the Amazon Web Services API (including Mechanical Turk) from Python
mathy stuff
- built-in math functions
math
modulescipy
,numpy
- packages for numerical computingscikits.learn
- machine learning library (looks promising, under active development)pygsl
- interface to GNU Scientific Library. GSL referencematplotlib
- for plottingrpy2
- Python interface to Runcertainties
- computation for numeric values associated with an error range and their derivativesnetworkx
- API for working with graph structures
times and dates
Python's support for date/time/calendar functionality is horribly confusing, and distributed across several modules with varied APIs—only some of which are in the standard library. See datetime module overview (with links at the bottom to related modules). The inscrutability of these APIs is well known (see "Pythonic Dates, Times, and Deltas" thread on python-ideas; cf. "Dealing with Timezones in Python").
I hacked together temporal.py as a means of providing a consistent interface for some of this functionality (esp. converting among string, integer, and special API representations of dates and times). Others have similarly written alternative APIs: [1], [2]
Development tools
The Pydev plugin for Eclipse is fantastic. Includes some fairly sophisticated static code analysis in order to support typing completion, etc.
ipython
is interactive Python shell with a number of useful features: support for some basic filesystem operations, tab completion of object attributes, and automatic saving of command history across sessions. It also incorporates an interface for parallel computing.
General resources
- On python.org: standard library (2.x, 3.x), standard library highlights (2.x, 3.x), language reference (2.x, 3.x), and an extensive tutorial (2.x, 3.x)
- Dive Into Python tutorial
- Python Module of the Week
- Hitchhiker's Guide to Packaging
- Planet Python
- Python Enhancement Proposals (PEPs)
- python-ideas list
- the BDFL
- Monty Python
installation/configuration for Mac OS 10.6 (Snow Leopard)
I am by no means an expert on Python distribution methods or compiling code on OS X, but after several frustrating encounters trying to install new Python modules here is my current understanding of the issues:
- There are multiple Python distributions for Mac: official python.org releases and the built-in ones released by Apple, which tend to lag behind the official releases. These are installed to different paths. I use official python.org releases, as can be seen with the
which
command in the shell:$ which python2.7 /Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7
- Compiled code (e.g. in C or C++) is specific to a chipset/architecture. Architectures relevant to Macs are:
- ppc (PowerPC; these are for old systems)
- i386 (32-bit Intel)
- x86_64 (64-bit Intel; sometimes known as i686)
- Snow Leopard supports both 32-bit and 64-bit Intel architectures for software. "Universal binaries" include multiple modes. You have to choose one or more architectures when you download an official python.org release.
- To check the architecture of a binary, use the
file
command:$ file /Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7: Mach-O universal binary with 2 architectures /Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 (for architecture ppc): Mach-O executable ppc /Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 (for architecture i386): Mach-O executable i386
- As I have an Intel machine, this means all Python libraries must support the 32-bit i386 architecture. (Installing the x86-64/i386 version of Python might make things easier; I suspect that installer had not yet been released when I set up Python 2.7 on my system.)
- To check the architecture of a binary, use the
- Python packages come in multiple flavors. Pure-Python packages are written entirely in Python, and should be a piece of cake to install (e.g. with easy_install or pip). Other packages, such as numpy, PIL, PyQt, and the like depend on compiled code written in languages like C, C++, even FORTRAN! Compilers for these languages are released by Apple as part of the XCode toolkit, which you will need to install in order to compile said code.
- XCode compilers on Snow Leopard default to 64-bit mode (that is, the C/C++/etc. compilers produce 64-bit binaries by default). This is a problem if, like me, you have a Python installation that is not 64-bit compatible! Here are some tricks to employ before attempting to install any package that includes compiled code:
$ export MACOSX_DEPLOYMENT_TARGET=10.6 $ export ARCHFLAGS="-arch i386" $ export CFLAGS="-m32" $ export CXXFLAGS="-m32"
- If you get a "mach-o" error indicating an architecture mismatch, you may need to download and compile the sources. For example, in order to install the PyQt and pyzmq dependencies for the Qt console in IPython 0.11, I used the Qt Cocoa binary installer (32-bit and 64-bit), then installed the SIP, PyQt4, ZeroMQ sources (following the SIP and PyQt instructions here but with i386 instead of x86_64), and then used
$ easy_install pyzmq
- If you get a "mach-o" error indicating an architecture mismatch, you may need to download and compile the sources. For example, in order to install the PyQt and pyzmq dependencies for the Qt console in IPython 0.11, I used the Qt Cocoa binary installer (32-bit and 64-bit), then installed the SIP, PyQt4, ZeroMQ sources (following the SIP and PyQt instructions here but with i386 instead of x86_64), and then used
Cf. Installing Python 2.7, matplotlib
, and ipython
on Mac OS X Snow Leopard
Multiple friends have recommended the Homebrew package manager. I have not tried it yet and thus am not sure how it will handle the architecture issues.
general programming resources
- Call Me Crazy: Calling Conventions (overview of subroutine concepts and terminology)
Thoughts
Why I use Python (in case you care)
Today, programmers have a wide menu of options when it comes to programming languages. This is a good thing, as people have different needs and different tastes. I do not presume that any particular language is best for all people or all purposes.
That being said, I find Python uniquely fun and satisfying. It essentially boils down to two reasons: usability and community.
Usability
I like to think of Python as pseudocode that works. Ideally, code should effortlessly reflect algorithmic ideas; the language and development tools should work to support the developer, not the other way around.
There are several parts to usability:
- Language expressivity and transparency
- I refer here to the ease of mapping between the conceptual and the formal when it comes to describing an algorithm or system. This includes learnability (effort required to master the basics of the language), writeability (effort required to transform ideas into code once you know the language), and readability (effort to reconstruct ideas from code).
- Domain flexibility
- Some languages target a particular type of application. PHP is specialized for Web scripting; Perl is specialized for text processing; Matlab and R are specialized to mathematical/statistical computation. Python, in contrast, seeks to support important general-purpose programming styles and idioms in the language, and to provide specialized functionality in libraries (such as Django, NLTK, numpy/scipy/matplotlib).
- Performance and compatibility
- As a dynamic, interpreted language, cross-platform compatibility is rarely an issue with standard Python implementations. Performance is perhaps the greatest challenge Python must face if it is to become a truly general-purpose language. There is a lot of community excitement around performance-oriented implementations such as PyPy ([1], [2], [3]). Efforts including Jython, Cython,
ctypes
, andrpy2
provide cross-language interoperability, whether for performance or other reasons. - Development support
- Tools/resources like editors, interpreters, debuggers, and documentation make the development process less frustrating and more efficient. Python's interactive interpreter and introspection capabilities, I think, are enormous assets when it comes to understanding how code works.
- Low overhead
- The above two strengths make for a low barrier to entry for Python novices and a better day-to-day experience for experts. Starting a new project, understanding and tweaking existing projects, and making a piece of software robust (documentation, cross-platform compatibility, etc.) are all important aspects of this experience.
Community
The size, diversity, and enthusiasm of a programming language's user base is important. These characteristics dictate how much pressure there is to keep the language (and its libraries and resources) polished and relevant. Statistics attest to Python's large and growing user base ([1], [2], [3]). It is an open-source effort with support from individual volunteers as well as industry (e.g. Google). It is popular for many purposes, ranging from scientific computing to text processing to client and Web applications to education.
Usability and community feed each other: a more usable technology translates to less of a barrier to entry and greater user satisfaction, which makes for a larger community. A larger community means more diversity of the user base, and more energy towards making the technology more useful to more people.
History
- 2012-04-08
- [2.06] added sequence comparison and BetweenDict, included type() in built-in object operations, improved file headers and fileinput examples, fixed diveintopython link and bug in @xkwargs decorator
- 2011-08-07
- [2.05] elaborated the issues with installing/configuring Python for Snow Leopard; linked to Dive Into Python and "Dealing with Timezones in Python"
- 2011-06-25
- [2.04] added INI-style configuration files
- 2011-06-01
- [2.03] added interactive post-execution for debugging
- 2011-05-29
- [2.02] added iteration patterns
for
everyone, sequence literal repetition operator - 2011-05-26
- [2.01] added
super()
/multiple inheritance, passing methods to higher-order functions, runtime CPU profiling:cProfile
- 2011-05-23
- [2.0] new section: Execution & I/O, including a new subsection on file and stream handling; updated mathematical links, including new link to
scikits.learn
; added Jonathan Elsas's comment about generator expressions - 2011-05-15
- [1.01] added variable scoping
- 2011-05-14
- [1.0]