Brian Zhang's blog

Statistics and other topics

Distributions with SymPy

Posted Apr 4, 2018 · 8 min read

Any good statistics student will need to do some integrals in her / his life. While I generally feel comfortable with simple integrals, I thought it might be worth setting up a workflow to help automate this process!

Previously, especially coming from a physics background, I’ve worked a lot with Mathematica, an advanced version of the software available online as WolframAlpha. Mathematica is extremely powerful, but it’s not open-source and comes with a hefty license, so I decided to research alternatives.

The main options I looked into were Sage, Maxima, and SymPy, and I eventually decided to take SymPy for a spin.1 This will also be my first post in Python, which is well-supported by the knitr / R Markdown framework.

Expectations as integrals

Given a PDF \(f(x)\) of a continuous random variable \(X\), expectations of functions of \(X\) take the form of integrals. Concretely, let \(g(X)\) be a function of the random variable. Then \[ \mathbb{E}_X[g(X)] = \int_{-\infty}^{\infty}g(x)f(x)dx \] (If \(X\) is a multivariate random variable, the integral should be appropriately converted to multiple dimensions.)

The normalization condition of a PDF can be written as: \[ \mathbb{E}_X[1] = \int_{-\infty}^{\infty}f(x)dx = 1 \] Moments of \(X\) take the form: \[ \mathbb{E}_X[X^n] = \int_{-\infty}^{\infty}x^nf(x)dx \] From which one can get the mean, \(\mathbb{E}_X[X]\), and variance, \[ Var(X) = \mathbb{E}_X[X^2] - (\mathbb{E}_X[X])^2 \] One final useful expectation is the moment generating function, or MGF. For a real variable \(t\), the MGF is a function of \(t\) in a neighborhood around 0 such that the expectation \[ M_X(t) = \mathbb{E}_X[e^{tX} ]= \int_{-\infty}^{\infty}e^{tx}f(x)dx \] exists.

In this post, we’ll use SymPy to try to compute these quantities analytically for a few distributions. The type of software exemplified by SymPy and Mathematica is called a computer algebra system, and uses coded rules to manipulate expressions.

Setup

First we import SymPy:

import sympy as sym
print sym.__version__
## 1.1.1

To write SymPy expressions, one first defines the symbols that are manipulated. We start out with \(x\), the variable with respect to which PDFs are defined, and \(t\), the variable for MGFs. We then define some simple helper functions for expressing our expectations of interest.

sym.init_printing()
x, t = sym.symbols('x t', real=True)
def area(dist):
    return sym.simplify(sym.integrate(dist, (x, -sym.oo, sym.oo)))
def mean(dist):
    return area(dist*x)
def EX2(dist):
    return area(dist*x**2)
def variance(dist):
    return sym.simplify(EX2(dist) - mean(dist)**2)
def mgf(dist):
    return sym.simplify(area(dist*sym.exp(x*t)))
def latex(result):
    return "$" + sym.latex(result) + "$\n" 
def summarize(dist):
    print "Distribution: " + latex(dist)
    print "Area: " + latex(area(dist))
    print "Mean: " + latex(mean(dist))
    print "Variance: " + latex(variance(dist))
    print "MGF: " + latex(mgf(dist))
summarise = summarize  # alias

Our summarize (or summarise) function allows us to print the relevant summary information given a “distribution”, which is just a SymPy function of x.

Next, we define the other symbols that will be used throughout this post.2

# Define other symbols that show up
mu = sym.symbols('mu', real=True)
sigma, a, b, lamb, nu = sym.symbols('sigma a b lambda nu', positive=True)

Distributions

Normal distribution: \(\mathcal{N}(x; \mu, \sigma^2)\)

We start with the normal distribution:3

normal = (2*sym.pi*sigma**2) ** sym.Rational(-1, 2) * sym.exp(-(x-mu)**2/(2*sigma**2))
summarize(normal)

Distribution: \(\frac{\sqrt{2}}{2 \sqrt{\pi} \sigma} e^{- \frac{\left(- \mu + x\right)^{2}}{2 \sigma^{2}}}\)

Area: \(1\)

Mean: \(\mu\)

Variance: \(\sigma^{2}\)

MGF: \(e^{\frac{t}{2} \left(2 \mu + \sigma^{2} t\right)}\)

All four quantities are correct! (See Wikipedia.)4

Laplace distribution: \(DoubleExp(x; \mu, b)\)

laplace = (2*b) ** (-1) * sym.exp(-sym.Abs(x-mu)/b)
summarize(laplace)

Distribution: \(\frac{1}{2 b} e^{- \frac{1}{b} \left|{\mu - x}\right|}\)

Area: \(1\)

Mean: \(\mu\)

Variance: \(2 b^{2}\)

MGF: \(\begin{cases} - \frac{e^{\mu t}}{b^{2} t^{2} - 1} & \text{for}\: \operatorname{periodic_{argument}}{\left (e^{\frac{i \pi}{2}} \operatorname{polar\_lift}{\left (t \right )},\infty \right )} = 0 \\\int_{-\infty}^{\infty} \frac{1}{2 b} e^{\frac{1}{b} \left(b t x - \left|{\mu - x}\right|\right)}\, dx & \text{otherwise} \end{cases}\)

I have no idea what the intimidating condition is, but the MGF is correct.

Exponential distribution: \(Exp(x; \lambda)\)

This function is defined piecewise:

expo = sym.Piecewise(
    (0, x < 0),
    (lamb * sym.exp(-lamb*x), True)
)
summarize(expo)

Distribution: \(\begin{cases} 0 & \text{for}\: x < 0 \\\lambda e^{- \lambda x} & \text{otherwise} \end{cases}\)

Area: \(1\)

Mean: \(\frac{1}{\lambda}\)

Variance: \(\frac{1}{\lambda^{2}}\)

MGF: \(\begin{cases} \frac{\lambda}{\lambda - t} & \text{for}\: \left|{\operatorname{periodic_{argument}}{\left (e^{i \pi} \operatorname{polar\_lift}{\left (t \right )},\infty \right )}}\right| \leq \frac{\pi}{2} \\\int_{-\infty}^{\infty} \begin{cases} 0 & \text{for}\: x < 0 \\\lambda e^{- x \left(\lambda - t\right)} & \text{otherwise} \end{cases}\, dx & \text{otherwise} \end{cases}\)

Gamma distribution: \(Gamma(x; a, b)\)

gamma = sym.Piecewise(
    (0, x < 0),
    (b**a / sym.gamma(a) * x**(a-1) * sym.exp(-x*b), True)
)
summarize(gamma)

Distribution: \(\begin{cases} 0 & \text{for}\: x < 0 \\\frac{b^{a} x^{a - 1}}{\Gamma{\left(a \right)}} e^{- b x} & \text{otherwise} \end{cases}\)

Area: \(1\)

Mean: \(\frac{a}{b}\)

Variance: \(\frac{a}{b^{2}}\)

MGF: \(\begin{cases} \begin{cases} b^{a} t^{- a} \left(\frac{1}{t} \left(b - t\right)\right)^{- a} & \text{for}\: \frac{b}{\left|{t}\right|} > 1 \\b^{a} t^{- a} \left(\frac{1}{t} \left(- b + t\right)\right)^{- a} e^{- i \pi a} & \text{otherwise} \end{cases} & \text{for}\: \left|{\operatorname{periodic_{argument}}{\left (e^{i \pi} \operatorname{polar\_lift}{\left (t \right )},\infty \right )}}\right| \leq \frac{\pi}{2} \\\int_{-\infty}^{\infty} \begin{cases} 0 & \text{for}\: x < 0 \\\frac{b^{a} x^{a - 1}}{\Gamma{\left(a \right)}} e^{x \left(- b + t\right)} & \text{otherwise} \end{cases}\, dx & \text{otherwise} \end{cases}\)

Fun fact: Wikipedia tells us that the Exponential, Chi-squared, and Erlang distributions are all special cases of the Gamma.

Beta distribution: \(Beta(x; a, b)\)

The Beta distribution is the first one that SymPy was unable to evaluate. When I tried the area, mean, variance, and MGF, all the integrals hanged, and I had to abort the operation.5

beta = sym.Piecewise(
    (0, x < 0),
    (0, x > 1),
    (x**(a-1)*(1-x)**(b-1)/(sym.gamma(a)*sym.gamma(b)/sym.gamma(a+b)), True)
)
print "Distribution: " + latex(beta)
# area(beta)  # had to abort

Distribution: \(\begin{cases} 0 & \text{for}\: x > 1 \vee x < 0 \\\frac{x^{a - 1} \Gamma{\left(a + b \right)}}{\Gamma{\left(a \right)} \Gamma{\left(b \right)}} \left(- x + 1\right)^{b - 1} & \text{otherwise} \end{cases}\)

Uniform distribution

However, the Uniform distribution, a special case of the Beta with \(a = b = 1\), works just fine:

uniform = sym.Piecewise(
    (0, x < 0),
    (0, x > 1),
    (1, True)
)
summarize(uniform)

Distribution: \(\begin{cases} 0 & \text{for}\: x > 1 \vee x < 0 \\1 & \text{otherwise} \end{cases}\)

Area: \(1\)

Mean: \(\frac{1}{2}\)

Variance: \(\frac{1}{12}\)

MGF: \(\begin{cases} 1 & \text{for}\: t = 0 \\\frac{1}{t} \left(e^{t} - 1\right) & \text{otherwise} \end{cases}\)

Student t distribution

Last we come to the Student t distribution. This one doesn’t have an MGF (see Wikipedia), so we display each quantity of interest separately rather than use our summarize function.

student = (1 + ((x-mu) / sigma)**2 / nu)**(-(1+nu)/2) * sym.gamma((nu+1)/2)/(sym.gamma(nu/2)*sym.sqrt(nu*sym.pi)*sigma)
print "Distribution: " + latex(student)

Distribution: \(\frac{\left(1 + \frac{\left(- \mu + x\right)^{2}}{\nu \sigma^{2}}\right)^{- \frac{\nu}{2} - \frac{1}{2}}}{\sqrt{\pi} \sqrt{\nu} \sigma \Gamma{\left(\frac{\nu}{2} \right)}} \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}\)

print "Area: " + latex(area(student))

Area: \(1\)

print "Mean: " + latex(mean(student))

Mean: \(\begin{cases} \mu & \text{for}\: \frac{\nu}{2} + \frac{1}{2} > 1 \\\int_{-\infty}^{\infty} \frac{\nu^{\frac{\nu}{2} + 1} \sigma^{\nu} x \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}}{2 \sqrt{\pi} \Gamma{\left(\frac{\nu}{2} + 1 \right)}} \left(\mu^{2} - 2 \mu x + \nu \sigma^{2} + x^{2}\right)^{- \frac{\nu}{2} - \frac{1}{2}}\, dx & \text{otherwise} \end{cases}\)

print "Variance: " + latex(sym.trigsimp(variance(student)))

Variance: \(- \begin{cases} \mu^{2} & \text{for}\: \frac{\nu}{2} + \frac{1}{2} > 1 \\\left(\int_{-\infty}^{\infty} \frac{\nu^{\frac{\nu}{2} + 1} \sigma^{\nu} x \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}}{2 \sqrt{\pi} \Gamma{\left(\frac{\nu}{2} + 1 \right)}} \left(\mu^{2} - 2 \mu x + \nu \sigma^{2} + x^{2}\right)^{- \frac{\nu}{2} - \frac{1}{2}}\, dx\right)^{2} & \text{otherwise} \end{cases} + \begin{cases} \frac{2 \left(\mu^{2} \nu - 2 \mu^{2} + \nu \sigma^{2}\right) \cos^{2}{\left (\frac{\pi \nu}{2} \right )}}{\left(\nu - 2\right) \left(\cos{\left (\pi \nu \right )} + 1\right)} & \text{for}\: - \frac{\nu}{2} + \frac{5}{2} < 2 \wedge - \frac{\nu}{2} + 3 < 2 \\\int_{-\infty}^{\infty} \frac{\nu^{\frac{\nu}{2} + 1} \sigma^{\nu} x^{2} \Gamma{\left(\frac{\nu}{2} + \frac{1}{2} \right)}}{2 \sqrt{\pi} \Gamma{\left(\frac{\nu}{2} + 1 \right)}} \left(\mu^{2} - 2 \mu x + \nu \sigma^{2} + x^{2}\right)^{- \frac{\nu}{2} - \frac{1}{2}}\, dx & \text{otherwise} \end{cases}\)

Here, I used sym.trigsimp, which added a few simplifications compared to sym.simplify (you can check this yourself). Yet still, SymPy doesn’t quite get the simplified expression for the variance. Notice that \[ 2\cos^2(y/2) = \cos(y) + 1 \] so the expressions with \(y = \pi \nu\) should cancel. If we notice this by eye, we can then use SymPy to finish the job, which is valid for \(\nu > 2\).6

expression = -mu**2 + (mu**2*nu-2*mu**2+nu*sigma**2)/(nu-2)
print(latex(expression))

\(- \mu^{2} + \frac{1}{\nu - 2} \left(\mu^{2} \nu - 2 \mu^{2} + \nu \sigma^{2}\right)\)

print(latex(sym.simplify(expression)))

\(\frac{\nu \sigma^{2}}{\nu - 2}\)

Conclusion

Out of seven common continuous distributions, SymPy pretty much solved five of them. For the Beta distribution, it had trouble with all the integrals, and for the Student t distribution, it failed to notice some simplifications.

I imagine this is not competitive with Mathematica, but for its free price and Python integration, I do think SymPy could make a valuable addition to a statistician’s toolbox.

This blog post was generated from an R Markdown file using the knitr and blogdown packages. The original source can be downloaded from GitHub.

UPDATE 2018-04-06: printing the area calculation (PDF normalization) as well.


  1. I found these all from this StackExchange thread.

  2. lamb instead of lambda, because lambda is a predefined Python construct.

  3. See this link for the rationale behind sym.Rational.

  4. I chose to display code output using the knitr option results="asis", so that the LaTeX formatting would show up.

  5. This is also documented by another user as a GitHub issue.

  6. I find it strange that SymPy wasn’t able to simplify the conditions on \(\nu\) to give \(\nu > 1\) for the first part and \(\nu > 2\) for the second.

comments powered by Disqus