Design, Prototypical Implementation, and Evaluation of an Active Machine Learning Service in the Context of Legal Text Classification - sebis TU ...

Page created by Jim Fischer

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Design, Prototypical Implementation, and Evaluation of an Active Machine Learning Service in the Context of Legal Text Classification - sebis TU ...

Design, Prototypical Implementation, and
Evaluation of an Active Machine Learning
Service in the Context of Legal Text
Classification
Johannes Muhr, July 10th 2017, Munich

Chair of Software Engineering for Business Information Systems (sebis)
Faculty of Informatics
Technische Universität München
wwwmatthes.in.tum.de

Outline
1. Motivation

2. Research Questions

3. Research Approach & Objectives

4. Knowledge Base

5. LexML

6. Evaluation

July 10th 2017, Final presentation Johannes Muhr © sebis 2

Motivation I

• Processes of legal experts (scientists and lawyers) are
 • time-intensive
 • knowledge-intensive
 • data-intensive

• Legal Documents
 • are growing strongly in recent years
 • are changing over time
 • are becoming more complex

à Therefore, searching for relevant information is
 • expensive [1]
 • difficult [2]

 [1] Roitblat, H. L., et al.
 [2] Paul, G.L. and J.R. Baron.

July 10th 2017, Final presentation Johannes Muhr © sebis 3

Motivation II

“There seems to be little question that machine learning will be applied to the legal
sector. It is likely to fundamentally change the legal landscape.” [3]

• Legal Data Science in incorporating machine learning is gaining more and
 more attention, because
 • process time and memory space are cheap
 • algorithms can process textual data fast and accurately
 • suitable frameworks are available

 [3] Liu, B.

July 10th 2017, Final presentation Johannes Muhr © sebis 4

Research Questions

 What are common concepts, strategies and technologies used for
 text classification?

 How can (active) machine learning support the classification
 of legal documents and their content (norms)?

 What form does the concept and design of an active machine
 learning service take?

 How well does the active machine learning service perform in
 classifying legal documents and their content (norms)?

July 10th 2017, Final presentation Johannes Muhr © sebis 5

Research Approach & Objective

 Environment IS Research Knowledge Base
 People / Develop / Build Foundations /
 Organizations Methodologies
 • LexML
 • People working • (Legal) Text
 with legal Business Applicable Classification
 content Needs Knowledge • (Active)
 Assess Refine Machine
 Technology Learning
 Justify / Evaluate • Machine
 • Lexia Learning
 • Document Frameworks
 Classification
 • Norm
 Classification

 Applications in the Additions to the Von Alan, R.H., et al.
 Appropriate environment Knowledge Base
July 10th 2017, Final presentation Johannes Muhr © sebis 6

Knowledge Base

 Iterative and interactive machine learning concept utilizing the strategy that the
 learning algorithm can select the data from which it learns.

 Most informative instances

 Algorithms making use of the predictions
 Preprocessing à Feature Vector à Learning made by the learning algorithm (classifier) to
 Algorithm (e.g. Naïve Bayes, Logistic Regression) find informative instances

 4
 Labeled
 instances Spark MLlib
 (ML framework) Predicted Query
 1 2 Unlabeled 3 Strategy
 • ML Pipeline
 suitable for text Instances
 classification

 Unlabeled 4
 instances
 other instances

July 10th 2017, Final presentation Johannes Muhr © sebis 7

Environment and IS Research

July 10th 2017, Final presentation Johannes Muhr © sebis 8

LexML – Basics
• Independent service using Apache Spark MLlib as machine learning framework
• Implementing the logic for active learning to perform multiclass (legal) text
 classification
• Accessible via Rest API

• Configurable via Lexia
 • Label
 • Classifier
 • Naïve Bayes, Logistic Regression, Multilayer Perceptron
 • Data
 • Query strategy

• Export of evaluation results in a xlsx-file

July 10th 2017, Final presentation Johannes Muhr © sebis 9

Evaluation – Experimental Setup
• Two experiments
 • Document classification
 • Norm classification
 • Simulating active learning using the available labeled dataset

• Evaluation of twelve different machine learning combinations
 • Nine active learning strategies
 • Three passive learning strategies

• Evaluation process
 • After each learning round, the resulting model was applied to the test data
 • Export of common evaluation measures (F1, accuracy, precision & recall)
 • Use the average result of five rounds per combination as reference value

July 10th 2017, Final presentation Johannes Muhr © sebis 10

Evaluation – Document Classification I

 • Classifying 1000 random documents from the datev corpus into one of
 three classes using a subset of words of the document

 • Random, but even split in training- and test dataset

 Class Sub-classes Support
 Gesetzestext (Gesamttext), Richtlinie (Gesamttext), EU-Richtlinie
 Law (“Gesetz”) 12.8%
 (Gesamttext),
 Judgment (“Urteil”) Urteil, Beschluss, Gerichtsbescheid 65.5%
 Generic legal Aufsatz, Anmerkung, Verfügung, Verfügung (koordinierter Ländererlass),
 document (“Sonstige Kurzbeitrag, Erlass, Erlass (koordinierter Ländererlass), Schreiben, 21.6%
 Dokumente”) Schreiben (koordinierter Ländererlass), Übersicht, Mitteilung

July 10th 2017, Final presentation Johannes Muhr © sebis 11

Evaluation – Document Classification II

 Average Accuracy of Classifiers using Active Learning vs.
 Passive Learning
 100

 95

 90

 Naïve Bayes predominant classifier, followed by
 Accuracy (%)

 85
 Multilayer Perceptron

 80

 Active learning clearly superior to passive learning
 75

 70

 Labeled Instances (%)
 65
 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99

 Naive Bayes Naive Bayes (random)

 Logistic Regression Logistic Regression (random)

 Perceptron Perceptron (random)

July 10th 2017, Final presentation Johannes Muhr © sebis 12

Evaluation – Norm Classification I

 • Classifying 504 sentences (norms) of the law of tenancy section of the
 German civil code (BGB) as one of eight possible classes

 • Random, but even split in training- and test dataset

 Class Sub-classes Support
 Recht Gebot (positiv/negativ/Soll-Pflicht (H)), Verbot (/Duldung (U)) 25.0%
 Pflicht Erlaubnis (/Erlaubnis beschränkt), Ermächtigung 21.6%
 Unwirksamkeit (/Wirksamkeit beschränkt (sachlich)/(persönlich)),
 Einwendung
 Unzulässigkeit (/Zulässigkeit beschränkt), Ausschluss (/Ausschluss 18.3%
 (Einw rh)
 beschränkt), Nichtigkeit
 Rechtsfolge Rechtzuweisung (/Rechtsübergang), Pflichtzuweisung (/Pflichterweiterung),
 9.9%
 (RF vAw) Rechtseigenschaft, Freistellung
 Form, Frist (/Fälligkeit), Maßstab (inkl. Berücksichtigung),
 Verfahren 9.7%
 Entscheidungskompetenz
 Direkt § /direkt Vorschriftenkomplex, Analog § /analog Vorschriftenkomplex,
 Verweisung 9.1%
 Negativ
 Fortführungsnorm Ausnahme, Erweiterung, Einschränkung, Gleichstellung 3.8%
 Definition Direkt, Indirekt, Negativ 2.6%

July 10th 2017, Final presentation Johannes Muhr © sebis 13

Evaluation – Norm Classification II

 Average Accuracy of Classifiers using Active Learning vs.
 Passive Learning
 100,00

 90,00 Active learning clearly superior to
 Logistic Regression
 passive learning
 best classifier
 80,00
 Accuracy (%)

 70,00

 60,00

 50,00

 40,00

 30,00
 5 9 13 17 21 25 29 33 37 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
 Labeled Instances (%)

 Naive Bayes Logistic Regression Perceptron
 Naive Bayes (random) Logistic Regression (random) Perceptron (random)

July 10th 2017, Final presentation Johannes Muhr © sebis 14

Summary
 Conclusions
 • Active Learning is a promising approach to conduct legal text
 classification
 • Better results using less instances
 • The choice of the parameters is a very complex issue

 Limitations
 • Evaluation with “only” one default setting
 • Reusability of trained model
 • Black-box classifier

 Future Work
 • Conduct of additional evaluations rounds with (varying learning settings)
 • Implementation of additional features
 • Query strategies
 • Learning algorithms
 • Combination of (active) machine learning and rule-based learning

July 10th 2017, Final presentation Johannes Muhr © sebis 15

Johannes Muhr
Advisor: Bernhard Waltl

Technische Universität München
Faculty of Informatics
Chair of Software Engineering for
Business Information Systems

Boltzmannstraße 3
85748 Garching bei München

Tel +49.89.289.17132
Fax +49.89.289.17136

matthes@in.tum.de
wwwmatthes.in.tum.de

Bibliography I
 § Busse, D. (2000). Textsorten des Bereichs Rechtswesen und Justiz. In G. Antos, K. Brinker, W. Heineman, & S. F.
 Sager (Eds.), Text- und Gesprächslinguistik. Ein internationales Handbuch zeitgenössischer Forschung.
 (Handbücher zur Sprach- und Kommunikationswissenschaft) (pp. 658-675). Berlin/New York: de Gruyter

 § Cardellino, C., Villata, S., Alemany, L. A., & Cabrio, E. (2015). Information Extraction with Active Learning: A Case
 Study in Legal Text. Paper presented at the International Conference on Intelligent Text Processing and
 Computational Linguistics.

 § de Maat, E., K. Krabben, and R. Winkels. Machine Learning versus Knowledge Based Classification of Legal Texts.
 in JURIX. 2010

 § Francesconi, E. and A. Passerini, Automatic classification of provisions in legislative texts. Artificial Intelligence and
 Law, 2007. 15(1): p. 1-17.

 § Gruner, R. H. (2008). Anatomy of a Lawsuit - A Client’s Analysis and Discussion of a Multi-Million Dollar Federal
 Lawsuit. Retrieved from http://www.gruner.com/writings/AnatomyLawsuit.pdf

 § Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine
 learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), 24. doi:10.1186/s40537-015-0032-1

 § Liu, B., Machine Learning and the Future of Law, L.T. Blog, Editor. 2015.

 § Novak, B., Mladenič, D., & Grobelnik, M. (2006). Text Classification with Active Learning. In M. Spiliopoulou, R.
 Kruse, C. Borgelt, A. Nürnberger, & W. Gaul (Eds.), From Data and Information Analysis to Knowledge Engineering:
 Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg,
 March 9–11, 2005 (pp. 398-405). Berlin, Heidelberg: Springer Berlin Heidelberg.

 § Paul, G.L. and J.R. Baron, Information inflation: Can the legal system adapt. Rich. JL & Tech., 2006. 13: p. 1.

 § Ratner, A., Leveraging Document Structure for Better Classification of Complex Legal Documents

July 10th 2017, Final presentation Johannes Muhr © sebis 17

Bibliography II
 § Roitblat, H.L., A. Kershaw, and P. Oot, Document categorization in legal electronic discovery: computer
 classification vs. manual review. Journal of the American Society for Information Science and Technology, 2010.
 61(1): p. 70-80.

 § Šavelka, J., Trivedi, G., & Ashley, K. D. (2015). Applying an Interactive Machine Learning Approach to Statutory
 Analysis.

 § Segal, R., Markowitz, T., & Arnold, W. (2006). Fast Uncertainty Sampling for Labeling Large E-mail Corpora. Paper
 presented at the CEAS.

 § Settles, B. (2010). Active learning literature survey. University of Wisconsin, Madison, 52(55-66), 11.

 § Sunkle, S., Kholkar, D., & Kulkarni, V. (2016, 5-9 Sept. 2016). Informed Active Learning to Aid Domain Experts in
 Modeling Compliance. Paper presented at the 2016 IEEE 20th International Enterprise Distributed Object
 Computing Conference (EDOC).

 § Tong, S. (2001). Active learning: theory and applications. Citeseer.

 § Tong, S., & Koller, D. (2002). Support vector machine active learning with applications to text classification. Journal
 of Machine Learning Research, 2(1), 45-66

 § Vern R. Walker, Ji Hae Han, Xiang Ni, Kaneyasu Yoseda (2017). Semantic Types for Computational Legal
 Reasoning: Propositional Connectives and Sentence Roles in the Veterans’ Claims Dataset, Paper presented at the
 ICAIL

 § Von Alan, R.H., et al., Design science in information systems research. MIS quarterly, 2004. 28(1): p. 75-105.

July 10th 2017, Final presentation Johannes Muhr © sebis 18

Backup – Related Work

 • Ratner (2014) performed an experiment in which he classified legal
 contract documents into categories, such as “Arbitration
 Agreements” or “Manufacturing Contract”

 • Šavelka, Trivedi and Ashley (2015) examined the relevance and
 applicability of the individual statutory provisions

 • Roitblat, Kershaw and Oot (2010) compared the classification
 accuracy of computers relative to traditional human manual review

 • De Maat, Krabben and Winkels (2010) conducted a norm
 classification experiment classifying sentences of the dutch law

 • Walker, Han, Ni and Yoseda (2017) examined semantic types for
 argument mining in legal texts to provide a common basis for
 machine learning

July 10th 2017, Final presentation Johannes Muhr © sebis 19

Backup– Document Classification I

 Average F1 of Naive Bayes per Class
 100

 90

 80

 Diversity of Generic Legal Documents is too large
 70
 F1 (%)

 60

 50
 Uneven distribution of instances is recognizable at
 the beginning
 40

 30
 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
 Labeled Instances (%)

 Law (NB) Judgment (NB) Generic (NB)

July 10th 2017, Final presentation Johannes Muhr © sebis 20

Backup – Document Classification II

 Comparison of Query Strategies (Naive Bayes)
 100

 95

 90

 85
 Accuracy (%)

 80

 75

 70

 65

 60
 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
 Labeled Instances (%)

 vote entropy margin sampling QBC vote entropy QBC soft vote entropy random

July 10th 2017, Final presentation Johannes Muhr © sebis 21

Backup – Document Classification III

 Average Precision of Naive Bayes per Class
 100

 90

 80

 70
 Precision (%)

 60

 50

 40

 30

 20
 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
 Labeled Instances (%)

 Law (NB) Judgment (NB) Generic (NB)

July 10th 2017, Final presentation Johannes Muhr © sebis 22

Backup – Document Classification IV

 Average Recall of Naive Bayes per Class
 100

 90

 80

 70
 Recall (%)

 60

 50

 40

 30

 20
 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99
 Labeled Instances (%)

 Law (NB) Judgment (NB) Generic (NB)

July 10th 2017, Final presentation Johannes Muhr © sebis 23

Backup – Norm Classification I

 Average F1 of Logistic Regression per Class
 100,00

 90,00

 80,00

 70,00

 60,00
 F1(%)

 50,00

 40,00

 30,00

 Recognition rate is highly
 20,00 Overfitting visible
 diversified

 10,00

 0,00
 5 9 13 17 21 25 29 33 37 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
 Labeled Instances (%)

 Recht Pflicht Einw rh RF vAw Verfahren Verweisung Fortführungsnorm Definition

July 10th 2017, Final presentation Johannes Muhr © sebis 24

Backup – Norm Classification II

 Comparison of Query Strategies (Logistic Regression)
 100,00

 90,00

 80,00
 Accuracy (%)

 70,00

 60,00

 50,00

 40,00

 30,00
 5 9 13 17 21 25 29 33 37 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
 Labeled Instances (%)

 vote entropy margin sampling QBC vote entropy QBC soft vote entropy random

July 10th 2017, Final presentation Johannes Muhr © sebis 25

Backup – Norm Classification III

 Average Precision of Logistic Regression per Class
 100,00

 90,00

 80,00

 70,00

 60,00
 Precision (%)

 50,00

 40,00

 30,00

 20,00

 10,00

 0,00
 5 9 13 17 21 25 29 33 37 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
 Labeled Instances (%)

 Recht Pflicht Einw rh RF vAw Verfahren Verweisung Fortführungsnorm Definition

July 10th 2017, Final presentation Johannes Muhr © sebis 26

Backup – Norm Classification IV

 Average Recall of Logistic Regression per Class
 100,00

 90,00

 80,00

 70,00

 60,00
 Recall (%)

 50,00

 40,00

 30,00

 20,00

 10,00

 0,00
 5 9 13 17 21 25 29 33 37 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
 Labeled Instances (%)

 Recht Pflicht Einw rh RF vAw Verfahren Verweisung Fortführungsnorm Definition

July 10th 2017, Final presentation Johannes Muhr © sebis 27

Backup – Norm Classification V

 Accuracy of Logistic Regression applying Vote Entropy Strategy
 100,00

 90,00

 80,00
 Accuracy (%)

 70,00

 60,00

 50,00

 40,00

 30,00
 5 9 13 17 21 25 29 33 37 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
 Labeled Instances (%)

 Round 1 Round 2 Round 3 Round 4 Round 5 Average

July 10th 2017, Final presentation Johannes Muhr © sebis 28

Backup – Evaluation Measures

• Confusion Matrix
 True Class

 A B

 Predicted A TP FP
 class B FN TN

 "#
• Precision =
 "#$%#

 "#
• Recall =
 "#$%&

 '∗#)*+,-,./∗0*+122
• F1 (F-score) =
 #)*+,-,./$0*+122

 "#$"&
• Accuracy =
 #$"&$%#$%&

July 10th 2017, Final presentation Johannes Muhr © sebis 29

Backup – Data Model

July 10th 2017, Final presentation Johannes Muhr © sebis 30

Backup – Active Learning Engine

July 10th 2017, Final presentation Johannes Muhr © sebis 31

Backup – Workflow Norm Classification

July 10th 2017, Final presentation Johannes Muhr © sebis 32

Backup – Query Strategies I

Uncertainty Sampling

 § Margin Sampling
 ∗ 
 § 4 = ø =' − ø =? 
 
 § Ambiguous instances with small margins should help the model to
 discriminate between them.

 § Vote Entropy
 
 § @∗ = − ∑F ø log ø 
 
 § Approach based on the Shannon entropy, measuring the variable’s
 average information content:

July 10th 2017, Final presentation Johannes Muhr © sebis 33

Backup – Query Strategies II

Query By Committee

 § Vote Entropy
 ∗ G F,J G F,J
 § GH = − ∑F 
 K K
 § y involves all possible labeling, V(y,x) is the number of committees that
 “voted” for label y for instance x, and |C| is the committee size.

 § Soft Vote Entropy
 ∗ 
 § NGH = − K K 
 
 § Taking the confidence of the decision into account: K =
 ?
 ∑+ ∈K + is the average (“consensus”) probability that y is
 K
 correctly labeled according to the committee.

July 10th 2017, Final presentation Johannes Muhr © sebis 34

Backup – Classifier I

Naïve Bayes

 • Feature d
 • Class c
 • P(d) constant (each feature has the same probability of being in the dataset)
 • Assumes strong independence between the single features
 P dc P c
 P cd =
 P d

Logistic Regression

 • Uses the characteristic that the probability of observing the true label (logit
 transformation) can be modelled as a linear combination of the attributes

July 10th 2017, Final presentation Johannes Muhr © sebis 35

Backup – Classifier II

Multilayer Perceptron

 • Several layers and units (nodes) in this layer Output Layer
 • Weights are assigned to each edge y

 • Back propagation is used as learning
 algorithm
 • Forward phase, weights are fixed and training Hidden Layer
 h h h
 instances are propagated through the 1 2 3
 network
 • Backward phase, the desired output is
 compared to the actual output at each output
 neuron to calculate the error rate Input Layer
 x x x
 • For each neuron in the network, all weights 1 2 3
 are adjusted so that the actual output better
 approximates the desired output

July 10th 2017, Final presentation Johannes Muhr © sebis 36

Backup – Supervised Machine Learning

July 10th 2017, Final presentation Johannes Muhr © sebis 37

Backup – Text Classification Process

July 10th 2017, Final presentation Johannes Muhr © sebis 38

Backup – Front-end (Configuration)

 Import of norms from selected legal documents

 Configuration of active learning settings

July 10th 2017, Final presentation Johannes Muhr © sebis 39

Backup– Front-end (Training)

 Labeling of legal documents

July 10th 2017, Final presentation Johannes Muhr © sebis 40

Backup – Comparison of Machine Learning
Frameworks I
 MLlib Mahout Weka Scikit-learn Mallet
 Current
 Version (2nd Feb. 2.1 0.12.2 3.8 0.18 2.0.8
 2016)
 Apache Apache Common
 General Public Berkeley Software
 License Software Software
 License (GPL) Distribution (BSD)
 Public License
 Foundation (AFS) Foundation (AFS) (CPL)
 Open Source yes yes yes yes yes
 OpenTable,
 Popular Users Verizon
 ? Pentaho Evernote, Spotify ?
 MapReduce
 Processing wrapper for Spark
 (deprecated), Spark none none
 Platform available
 Spark
 Mainly Scala,
 Interface Java, Scala,
 Java (for older Java, R Python Java
 Language Python, R
 versions)
 Suitable for large
 yes yes partially yes partially
 datasets
 Community
 good moderate good good moderate
 Support
 Documentation very good moderate good very good good
 Support for
 moderate
 NLP/Textual Data good
 (via Lucene)
 good good very good
 /AL
 Configuration simple c difficult c simple c simple very simple

July 10th 2017, Final presentation Johannes Muhr © sebis 41

Backup – Comparison of Machine Learning
Frameworks II
 MLlib Mahout Weka Scikit-learn Mallet
 Classification Algorithms
 ü ü
NB ü a (Spark (batch, ü ü
 optimized) incremental)
 ü
SVM ü b only via MLlib ü /
 (batch)
 ü
MLP ü a only via MLlib ü /
 (batch)
Multiclass one vs. all, one vs. all,
 one vs. all only via MLlib /
Classification one vs. one one vs. one
Multilabel by extensions (e.g.
 one vs. all / one vs. all /
Classification Mulan, Meka)
Pipeline ü a / / ü ü
Total Algorithm
 very high high very high very high moderate
Coverage
 Preprocessing
 (by extensions
Stemming / ü ü ü
 (Snowball))
Stop Words ü / ü ü ü
Coverage of Feature
 very high low very high very high moderate
Selection Methods
 Text Representation
Word Vector ü ü ü ü ü
TF-IDF ü ü ü
 Evaluation

Confusion Matrix ü ü ü ü ü

ROC/AUC Curve ü / ü ü ü

July 10th 2017, Final presentation Johannes Muhr © sebis 42

Backup – Literature study

• Use of online Platforms like
 • Google Scholar,
 • Web of Science,
 • Institute of Electrical and Electronics Engineers (IEEE),
 • or Online Public Access Catalogue (OPAC) and Google Books

• Backwards Search

July 10th 2017, Final presentation Johannes Muhr © sebis 43

Backup – Hourly Billings by Task and Individual

• Manual document classification is very expensive and time
 consuming
 • 13,5 Million $ were spent for classifying 1,6 Million items needing 4 month
 (= 8,50$ per document) [1]
• A lot of time is wasted with (document) discovery [2]

 Hours:

 1 446
 = 26,4 %
 5 486
 Dollars:

 483 986
 = 28,5 %
 1 697 322

 [1] Roitblat, H. L., et al. (2010). Document categorization in legal electronic discovery: computer classification vs. manual review.
 [2] Gruner, (2008). A Client’s Analysis and Discussion of a Multi-Million Dollar Federal Lawsuit
February 13th 2017, Kick-off presentation Johannes Muhr © sebis 44

You can also read