Ben Oren Pynchon was right:
    About     Archive     Feed

My Opinion We Can Use Data Science To Recommend Legal Opinions

The Set-Up

Lawyers are people too (shocking, I know). They make mistakes (even if they won’t admit it). And when they’re searching for legal opinions to build arguments, the dominant tools to find those opinions are dependent on lawyers to adequately use them.

Keyword searches and drilling down through citations are dependent on the skill of the lawyer doing them.

Even if a lawyer is a Westlaw ninja, they don’t know what they don’t know: they may miss a pertitnent but not obvious precedent opinion because it falls outside the bounds of their search parameters

Enter: the recommender

The Recommender

I built a recommendation system to suggest legal opinions. Give it an opinion and it will give back a ranked list of opinions that are most similar.

“Most similar” means “has the most words and citations in common”.

Let’s get into the nitty-gritty.

The Nitty-Gritty

Data sharing is data caring

I acquired data through the caselaw access project. People not granted access (i.e. us peons not affiliated with giant law firms or research institutions) are limited to 500 search queries a day. Natural language processing techniques work best with large amounts of data, and in the short time available to pick up data 500 queries wouldn’t cut it.

HOWEVAH. Illinois and Arkansas, in their wisdom, have released the entirety of their state court opinions at every level - district, appellate and supreme - to the public.

I acquired the entirety of the set of Illinois opinions in a Mongodb database, and extracted those opinions that lawyers would be interested in. (In other words, I filtered out short insubstantial concurring and dissenting opinions that are essentially ‘I agree / disagree with the majority’.

Cleaning the data with holy fire

The data was relatively well-behaved; the caselaw access project does a good job of OCRing legal opinions, and there weren’t weird artifacts in the texts (even from the 1800s).

Out of 182,000 cases there were a couple thousand from 1890-1920, and 1921 was the first year in which the number of opinions in modern years was sustained. After reading a sample of the 1890-1920 opinions and finding them materially different than modern ones, I excluded them.

I created a dataset with each opinion as an observation, and the judges, years, the opinion’s citation, and the opinion text as variables.

After that, I extracted a list of citations for each case using a package called LexPredict. I turned each case into a dummy variable, so that the dataset now looked like a list of cases with their citations as variables.

Then it was time to get into the NLP processing steps.

Tokens and stems and TF-IDF: the weird world of NLP

In order to calculate similarties among opinions, their words had to be broken down into individual units whose commanlaites across datasets can be tracked. Think of this as feature extraction: getting a list of features for each text. A popular way to do so is by vectorizing the words: essentially creating dummy variables of each word and so that each words becomes a feature in a document-word matrix.

HOWEVAH. Common grammatical words like ‘of’, ‘the’ and ‘to’ don’t provide any information about document similarity since they are literally in every document, and as such are a waste of processing time to include and would through off the results. Accordingly, they were excluded from the vectorization process.

Another technique that’s popular to vectorize words is to run them through a process that replaces each word dummy variable column - as series of 1s and 0s - with a system of weights that reflects how frequent or important a word is to the document. Yes, the infamous TF-IDF technique, savior to wise men and fools alike. (And me as well.)

At this stage, I had two document-word matrices - a TF-IDF version and a dummy variable version - to run through topic modeling techniques. In addition, I created another matrix of each version with ‘stemmed’ words. Stemming is a process that tries to bucket similar words across grammatical instances, so that words like ‘lay’ and ‘laid’, ‘saw’ and ‘seen’ are grouped respectively in broader buckets like ‘lie’ and ‘see’. This can be thought of as a form of feature or dimensionality reduction.

Topic Modeling: or, Viktor Frankl should have searched for meaning with non-negative matrix factorization

The most important part of the process I went through to group together texts was topic modeling. Topic modeling is a way of grouping together words that appear in similar contexts. These groups then become the features of a document-topic matrix, and each document is assigned a weight for each topic. This is yet another form of dimensionality reduction: replacing the tens of thousands of features that are individual words with tens of features that are topics.

What’s the trick?

The trick is making the topics coherent. Frequently, the list of topics will not cohere into anything resembling a intelligible structure. As such, the information that a document is weighting that topic either more or less heavily than other topics is meaningless. We’re looking for information like “this opinion is weighting a ‘police’ topic heavier than other opinoins’.

A common way of trying to get more coherent topics is to try out multiple topic generating algorithms. I tried using LSA/LDA (which use a dirichlet distrbution to probabilistically apply weights to words to get topics) and SVM and NMF (which are ways of decomposing a document-word matrix into a document-topic and topic-document matrix)

The first few iterations of topic modeling are mainly not about getting the topics coherent, but for finding the words that are too common in the corpus of texts to be useful splits but which are not obvious to exclude at the step where words like ‘to’, ‘and’ and ‘of’ were excluded. Words that fell into this group were words like ‘party’, ‘appellate’, judges’ names, and the like. Because they are so common among opinoins, when included in the vectorization and topic modeling process, they created their own groups that a large number of texts fell into. I subsequently excluded them.

I eventually settled on 20 topics with the NMF technique that were more or less coherent. In addition to eyeballing the top 50 or so words to make sure they cohered, I also looked at the ‘importance’ measure assigned to each word. This is a measure of the strentgh of each word within the topic. I looked for topics that had a more or less even descent down importance; topics that had a top few words with an order of magnitude or more importance than the following words were deemed less coherent than topcis that didn’t.

Going the cosine distance

Once the topics were found, it’s a realtively straight-forward process to create a recommendation system. The topic modeling process results in a document-topic matrix with documents as rows and their weights in each topic as columns. Applying a similarity metric to this matrix results in a matrix similar to a covariance or correlation matrix where the relationship from each row to every other row can be read either across a row or down a column.

Cosine distance is mathematically similar to the pearson correlation; the pearson correlation is the cosine distance with the mean removed from each observation so it’s centered around 0. One interpretation of the cosine distance is, geomterically, it’s the similarity among vectors according to the angle of the vector measured from the origin. In other words, the length of the vector is immaterial.

Applying this to a document topic matrix, the cosine distance would count documents that are closest among multiple topics as similar, no matter how closely aligned along one particular topic they are (indeed, two opinions might have the exact same weight for one topic but have a low cosine distance if their other topic weights are dissimilar enough).

This seems like an ideal way to measure the similarity among court opinions, each of which are always dealing with multiple issues of fact and abstract legal principles.

Figuring Out Pyplot And Matplotlib Once And For All

Exploring Matplotlib and Pyplot

I’m in the first week at the Metis datascience bootcamp. (Feel free to get in touch if you’d like to chat about it.)

It’s great!

And with the amount of time it requires, a quick facility with Matplotlib and pyplot would be a big help.

I . . . don’t have that yet. Figures? The weird syntax? Subplots? Th’ hell?

I’m going to figure it out. Come on the journey with me.

The Basics

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.plot([1,2,3,4]);

First graph

BOOM

But let’s break it down a bit.

  • The %matplotlib inline code is a bit of jupyter notebook magic that displays graphs within the jn without having to call “plt.show()” each time. (This also helps avoid plotting lines from different graphs in the same graphic.)

    Similarly, %config InlineBackend.figure_format = 'svg' is telling jn to use the svg format for graphics (which, when exported, won’t pixelate up on you when displayed in a web browser)

  • Oh uh I’m exporting this post from a jn

  • The semicolon at the end of the plt.plot call supresses jn from ‘printing’ the last line of the cell

  • Notice that the main library we’re using is a subclass of matplotlib called pyplot that we’re importing.

    It’s pyplot (which we imported as plt) that acts as the workhorse graphing tool. IOW: feed data into pyplot to produce a chart.

  • And already there’s a “th’ hell?” moment: we’re only feeding one list of data into pyplot (the y values), and it’s automatically providing x values to pair with the y values, but the x values go from 0:3 while our list was from 1:4 . . . ugh.

    What’s going on is that, if you feed a single sequence of values to pyplot, it treats it as y values and automatically assigns x values to graph it. So far so good, but why do the x-values start at 0? Beacuse, recall, python ranges start at 0.



Ok, this isn’t so bad. Let’s add x-values and start playing around with axes.

plt.plot([1,2,3,4], [1,4,9,16], 'ro');

Second graph

plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.xlim( (0,20) )
plt.ylim( (0,20) );

Third graph

Data
  • Feeding two series of data into pyplot acts as you would hope it would: it takes one series as x values and one series as y values

  • But notice that, of course, there’s a reverse of how it treats a single series of data: the series in the first position is now the x-values, and the second series is the y-values

Colors / Shapes
  • Notice also there’s some weird parameter after the series in py.plot
    • This is the first of many delightful quirks about matplotplib that are due to its origin.
    • It was designed to mimic the graphing capabilities of Matlab, and originally imported a lot of Matlab’s graphing syntax.
  • So, apparently, Matlab used a string concatenation system to assign colors and shapes to lines
    • The first character in the string assigns the color, and the second character assigns the shape
    • Color options are pretty much what you would expect: ‘b’ for blue, ‘r’ for red, ‘g’ for green, etc.
    • Shape options are delightfully irrational and hard to memorize: ‘s’ for square, ‘o’ for circle, ‘^’ for triangle
Axes
  • Setting axes is more straightforward
    • two axis methods, plt.xlim and plt.ylim, take tuples as the xmin, xmax and ymin, ymax respectively
    • notice how we were able to exaggerate the shape of a curve by manipulating axes, good to know good to know

Multiple Lines

Let’s try plotting multiple series on the same plot

import numpy as np

d = np.arange(0, 10, .5)
plt.plot(d, d, 'bo', d, d**1.5, 'r^', d, np.log(d), 'gs');

Fourth graph

Seems like we’re getting closer to a syntax that might be regularly useful for at least scratchpad work, right?

Notice that pyplot can take multiple line inputs in the form:

plt.plot(line1 x-values, line1 y-values, line1 graphic options, line2 x-values, line2 y-values, line2 graphic options . . . )

Pyplot objects

Let’s use this graph to see what else pyplot is doing

x,y,z = plt.plot(d, d, 'bo', d, d**1.5, 'r^', d, np.log(d), 'gs')

Fifth graph

x
<matplotlib.lines.Line2D at 0x11f6e6668>
y
<matplotlib.lines.Line2D at 0x11f6e6550>
z
<matplotlib.lines.Line2D at 0x11e37ce80>

Looks like in addition to graphing, plt.plot() also returns objects for each line.

These objects are called . . . Lines.

  • Their properties are the properties of the graphed lines
  • These properties are mutable
  • They can be changed (even after the plot is called) with a .setp() method in plt
x,y,z = plt.plot(d, d, 'bo', d, d**1.5, 'r^', d, np.log(d), 'gs')

plt.setp(x, color = 'm');

Sixth graph

So we can change the properties of a line with plt.setp(specific line, specific property to change)

That’s not too bad. What other objects are there? Couldn’t be too -

fig, ax = plt.subplots()

print(fig)
print(ax);
Figure(432x288)
AxesSubplot(0.125,0.125;0.775x0.755)

Seventh graph

oh noooooo

Figures and Subplots

Actually it’s not too bad

These are objects that every graphic in pyplot has. The plt.plot syntax we started with above just hides them.

(If you’re just dashing off a single quick graph for scratchpad exploratory anaylsis, you probably won’t need them.)

But when making multiple charts (or for public consumption) this syntax seems like it’ll be necessary to use.

Let’s take a look:

def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)

plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'g')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--');

Eighth graph

The bits before the plt calls should be straightforward to interpret:

  • we’re creating arrays from 0 to 5 in steps of .1 and .02
  • and a function which takes t and returns $e^{-t} * \cos(2\pi t)$

plt.figure creates a ‘figure instance’, which is both returnable:

x = plt.figure(1)
x;
<matplotlib.figure.Figure at 0x11f3cd278>

and stored in memory as the ‘location’ of the graphic.

  • Unfortunately, this means that if you’re working with multiple figures, they have to be individually closed for their memory to be released.
  • The ‘1’ value inside plt.figure(1) is just a name separating figure instances (it can be a string too).

plt.subplot creates the individual graphics. The 1’s and 2’s inside them have a weird Matlab syntax but is pretty straightforward:

  • plt.subplot(211) is arranging a plot in a space with 2 rows, 1 column, and is the first entry in the first row
  • plt.subplot(212) is arranging a plot space with 2 rows, 1 column, and is the first entry in the first row
  • etc.

These can be switched around:

plt.figure(1)

plt.subplot(121)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'g')

plt.subplot(122)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--');

Ninth graph

Get it?

But now they’re squished together; the axes on the second plot are running into the first graph; dogs and cats are living together, mass hysteria.

We can manipulate the plt.figure instance to increase the size in which the plots are displayed (values are in inches; other units can be specified if you want):

plt.figure(2, figsize = (12.2, 8))

plt.subplot(121)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'g')

plt.subplot(122)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--');

Tenth graph

ahhh, nothing like stretching out

A different syntax using subplot objects

You may have noticed that I mentioned a figure instance being created every time plot is called. So why are we calling plt.figure specifically? Shouldn’t there be a way to store it as an object for easy retrieval? Maybe along with an object for a specific plot?

I got you

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

Eleventh graph

THAT’S RIGHT PEEPS

There’s a method within pyplot called ‘subplots’ which creates figure and graphics objects.

These objects themselves now have methods which are able to store the plots we just created.

Can you see the raw power of this syntax?

t1 = np.arange(0,4,.02)
t2 = np.arange(5,20,.005)

arrays = [t1, t2]
fig, ax = plt.subplots(1, 2, figsize=(10, 4))

for graphic in [0,1]:
    color = ['--','-']
    x, y, color = (arrays[graphic], np.cos(2*np.pi*arrays[graphic]), 'r{}'.format(color[graphic]))
    ax[graphic].plot(x, y, color)

Twelfth graph

dynamic.

graph.

construction.

If only (oh if only) the syntax weren’t plt.subplots() when there’s already a plt.subplot() running around out there

Can’t have everything

Labels, axes values, adding text / graphics

I’ll end my current exploration by adding labels, a title, and some text to the graphs above. The syntax is all pretty straight-forward. (Notice you can also dynamically program plot attritubes like titles or labels).

Happy plotting, and stay safe out there

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

fig, ax = plt.subplots(1, 2, figsize = (12, 8))

xlabel = 'Hours after starting a project'
ylabel = 'Normalized {} units'
g_title = 'Calibrated Scale of My {} When Starting a Project'
words = ['work ethic', 'hunger']

ax[0].plot(t1, f(t1), 'bo', t2, f(t2), 'g')
ax[1].plot(t2, np.cos(2*np.pi*t2), 'r--')


ax[0].annotate('local max', xy=(1, .4), xytext=(1.5, .5),
            arrowprops=dict(facecolor='red', shrink=0.05),
            )

for graph in [0,1]:
   ax[graph].set_xlabel(xlabel)
   ax[graph].set_ylabel(ylabel.format(words[graph])
   ax[graph].set_title(g_title.format(words[graph].title());

Thirteenth graph