Explanation of the Legal Profession’s Remarkably Slow Adoption of Predictive Coding

Well-known predictive coding expert attorney, Maura Grossman, and her husband, noted information scientist, Gordon Cormack, recently began on article in Practical Law magazine with the assertion:

Adoption of TAR has been remarkably slow, considering the amount of attention these offerings have received since the publication of the first federal opinion approving TAR use (see Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182 (S.D.N.Y. 2012)).

Grossman & Cormack, Continuous Active Learning for TAR (Practical Law, April/May 2016).

Winners in Federal CourtTAR, which stands for Technology Assisted Review, is their favorite term for what the legal profession commonly calls predictive coding. I remember when our firm attained the landmark ruling in our Da Silva Moore case. I thought it would open a floodgate on new cases. It did not. But it did start a flood of judicial rulings approving predictive coding all around the country, and lately, around the world. See Eg. Pyrrho Investments v MWB PropertyEWHC 256 (Ch) (2/26/16). Judge Andrew Peck’s more recent ruling on the topic contains a good summary of the law. Rio Tinto PLC v. Vale S.A., 306 F.R.D. 125 (S.D.N.Y. 2015). The bottom line is that at this point in time, late May 2016, the Bench is waiting for the Bar to catch up.

LoveAlthough I am known for my exuberant endorsement of predictive coding, this enthusiasm for new technology to find electronic evidence is still rare in the legal profession. Losey, R., Why I Love Predictive Coding: Making document review fun with Mr. EDR and Predictive Coding 3.0. (2/14/16). So why do I love this technology so much, and most other lawyers, not so much? It may have to do with the fact that I have been using computers since 1978 and am very used to pushing the technology edge. But that just explains why I was one of the first to knock on the door of predictive coding, not why I like the room. I have been an early adopter of many technologies that proved disappointing. (Anybody want to buy a slightly used iWatch?) No, I like it because it really works.

This in turn raises the question why have not all attorneys had this same reaction. If it really works for me, it should really work for everyone, right? And so everyone should be loving predictive coding, right? No. It is not working for everyone. Many have had unpleasant experiences with predictive coding. They left the room bored and frustrated. I am reminded of the old commercial, Where’s the beef? They went back to their old familiar keyword searches. Pity.

It took me a while to figure this out, that others were having failures and not talking about it. (Who can blame them?) In retrospect I should have seen this earlier. Still, it took Grossman and Cormack until 2015 to figure this out too. Interestingly, we have come to the same conclusion on causation. Bad software is not the main reason, although varying software quality among vendors is part of the explanation. Some software on the market is not that good, or does not even have bona fide predictive coding features using active machine learning. But these software differences only explain some of the dissatisfaction. The real reason for the failures is that attorneys have not been using the predictive coding features properly. They have been doing it wrong. That is why it did not work well for them. That is why many attorneys tried it out and never returned.

Grossman and Cormack explain this and provide their best-practice methods in the new article, Continuous Active Learning for TAR, and many other articles they have written since 2015. I read and recommend them all. I have shared my own best practices in my lengthy personal blog, Predictive Coding 3.0 article, part one and part two. Part one describes the history and part two describes the method. Our best practices are not exactly the same, but they are close and compatible. I have written a total of 59 articles on the subject now that are currently all online and freely available. I call the method Hybrid Multimodal and its basic steps are shown in the figure below.


Feel free to drop me an email if you are an in-house counsel and want to know more about predictive coding best practices. Training on this topic is one of the services that we offer our clients. So too is the search for responsive evidence in litigated matters, or internal investigations, using our proven successful Hybrid Multimodal method of Predictive Coding document review.



The Exploitation of America’s Cybersecurity Vulnerabilities by China and Other Foreign Governments

The Chinese People’s Liberation Army attacks American companies every day to try to steal trade secrets and gain commercial advantage for state controlled businesses.


Gu Chunhui

Criminal hackers can cause tremendous damage, whether trained in China or not. If a high level expert, such as any member of China’s elite Unit 61398, aka Comment Crew, gets into your system, they can seize root control, and own it. They can then plant virtually undetectable back doors into your systems. This allows them to later come and go as they please.

A member of the Comment Crew could be in your computer system right now and you would not know it. For instance, Gu Chunhui, who often goes under the online alias, Kandy Goo, and is a high ranking military officer of Unit 61398, could be looking at your computer screen now. Captain Goo could be running programs in the background without your knowledge. Or he could be reading your email. He would be looking for some information of value to his country, or of value to any of the thousands of businesses controlled by the Chinese government. Captain Goo may have a cute Internet name, and look more like a movie star in a martial arts film than an army man, but do not be fooled. Do not underestimate his considerable computer skills and strong patriotic intent. Yes, breaking into your computer systems and stealing data is a matter of patriotic duty for him and other hackers trained by the government of communist China.

Unit 61398 of the Third Department of the Chinese People’s Liberation Army is reported to be the best of the best in China. Gu Chunhui is a determined military officer. Although  DOJ documents show that Gu, like everybody else in Shanghai where he is stationed, takes a two hour break every day for lunch,  he still works hard the rest of the day to break into your computer system and steal your data (and your client’s). He and others in Unit 61398 are armed and dangerous. They have both viruses and guns. They should not be taken for granted. All of the Unit 61398 Comment Crew, including Captain Goo, are very good at what they do. I am worried. You should be too.

Do not get me wrong, the Chinese government does not have a monopoly on black hat hacking. The whole idea was born in the United States. It could also just as easily be a criminal hacker from Russia, the Ukraine, Poland, Iran, or Syria, who has taken control of your system. It could be a teenager down the street. They could be from anywhere, although if they are after trade secrets, not money, it is probably one of the thousands of hackers who works for the Chinese government. It could even be one of the five officers in Unit 61398 in Shanghai that have been indicted by the DOJ.


DOJ’s 31 Count Criminal Indictment Against Five Military Officers
of Unit 61398 of the Third Department of the Chinese People’s Liberation Army

Chinese-cyber-war_DOJFive military officers of Unit 61398, including Gu Chunhui, were indicted in 2014 by the Department of Justice for theft of commercial trade secrets from several large U.S. Corporations and a Union. No, they have not been arrested, nor is it likely they ever will be. This was more of a symbolic gesture than anything else, a wake-up call for American business. Still, at least one person in the U.S., a Chinese businessman, has been arrested and convicted of helping the Chinese government steal trade secrets. Businessman admits helping Chinese military hackers target U.S. contractors (Washington Post, 3/23/16).

The DOJ has also recently unsealed charges made against the Syrian Electronic Army — a hacking group that supports embattled Syrian President Bashar al-Assad. In addition, on March 24, 2016, the Manhattan U.S. Attorney announced charges against seven Iranians for conducting a coordinated campaign of cyber attacks against the U.S. financial sector on behalf of the Islamic Revolutionary Guard. A copy of the indictment of the Iranians is published here by the DOJ. It is a very dangerous world right now and very challenging to protect trade secrets.

The indictment against the Chinese Military officers is especially notable to the legal profession in that some of the secrets allegedly stolen include attorney-client communications. See the 31 count indictment against five Chinese military officers for details. The chart below provides a high level overview. Every count is against all five officers.

Count(s) Charge Statute Maximum Penalty
1 Conspiring to commit computer fraud and abuse 18 U.S.C. § 1030(b). 10 years.
2-9 Accessing (or attempting to access) a protected computer without authorization to obtain information for the purpose of commercial advantage and private financial gain. 18 U.S.C. §§ 1030(a)(2)(C), 1030(c)(2)(B)(i)-(iii), and 2. 5 years (each count).
10-23 Transmitting a program, information, code, or command with the intent to cause damage to protected computers. 18 U.S.C. §§ 1030(a)(5)(A), 1030(c)(4)(B), and 2. 10 years (each count).
24-29 Aggravated identity theft. 18 U.S.C. §§ 1028A(a)(1), (b), (c)(4), and 2 2 years (mandatory consecutive).
30 Economic espionage. 18 U.S.C. §§  1831(a)(2), (a)(4), and 2. 15 years.
31 Trade secret theft. 18 U.S.C. §§ 1832(a)(2), (a)(4), and 2. 10 years.

The possibility, indeed probability of hacker attacks on law firms is one reason we outsource the holding of all large stores of our client’s electronic data in e-discovery. We put the ESI in the hands of a global vendor with one of the most secure  facilities in the world. Feel free to ask me about it. Protection of client data is an important ethical duty of every attorney. We take it very seriously and conduct all of our work accordingly.

Conclusion to 14 Part Series on Document Culling

This is Fourteenth and Final blog in a series on two-filter document culling. (Yes, we went for and obtained a world record on longest law blog series!) Document culling is very important to successful, economical document review. Please read parts onetwothreefourfivesixseveneightnineteneleventwelve and thirteen before this one.


ralphlosey_cartoon_smallThere is much more to efficient, effective review than just using software with predictive coding features. The methodology of how you do the review is critical. The two filter method described here has been used for years to cull away irrelevant documents before manual review, but it has typically just been used with keywords. I have shown in this lengthy series of blogs how this method can be employed in a multimodal manner that includes predictive coding in the Second Filter.

Keywords can be an effective method to both cull out presumptively irrelevant files, and cull in presumptively relevant, but keywords are only one method, among many. In most projects it is not even the most effective method. AI-enhanced review with predictive coding is usually a much more powerful method to cull out the irrelevant and cull in the relevant and highly relevant.

If you are using a one-filter method, where you just do a rough cut and filter out by keywords, date, and custodians, and then manually review the rest, you are reviewing too much. It is especially ineffective when you collect based on keywords. As shown in Biomet, that can doom you to low recall, no matter how good your later predictive coding may be.

If you are using a two-filter method, but are not using predictive coding in the second filter, you are still reviewing too much. The two-filter method is far more effective when you use relevance probability ranking to cull out documents from final manual review.


Employers Have An Obligation To Provide Meaningful Direction To Employees In Email Searches, But Employers Can’t Be Compelled To Recover Company Emails Stored On Personal Accounts Of Employees

Douglas_JohnstonThis blog post is written by Douglas Johnston in our San Francisco office.

A recent case from the Northern District of California raises the importance of actively engaging with employees to coordinate the search for documents and electronically-stored information to comply with the employer’s discovery obligations. At the same time, the Court ruled that an employer cannot be compelled to produce business-related emails from the personal email accounts of its employees.

In Matthew Enterprise, Inc. v. Chrysler Group, LLC, the plaintiff, Stevens Creek – a car dealership – sued Chrysler for price discrimination in violation of the Robinson-Patman Act.  During discovery, Chrysler sought emails from Stevens Creek’s employees’ corporate Gmail accounts as well as emails from the employees’ personal email accounts which, at times, were used for business purposes.

As to the emails from employees’ corporate accounts, Chrysler argued that Stevens Creek used inadequate search parameters, failed to provide employees with a copy of the discovery requests, did not provide any meaningful direction to the employees on how to identify requested ESI and did not ask all relevant custodians to search for documents. In opposition, Stevens Creek argued it had undertaken reasonable efforts in good faith to comply with the requests for production.

With regard to emails from employees’ personal accounts, Stevens Creek argued that the emails were outside its “possession, custody, or control,” and, therefore, beyond the scope of discovery from Stevens Creek Chrysler responded that Stevens Creek has control over company information regardless of whether it is stored on personal email accounts and pointed to plaintiff’s employee handbooks instructing employees to keep “internal information” in the “sole possession” of Stevens Creek.

Magistrate Judge Paul S. Grewal, applying the recent amendments to the Federal Rules of Civil Procedure, found Stevens Creek’s ESI search efforts to be lacking, citing as a specific examples, the suggestion by Stevens Creek to its employees to merely pull any email with the word “Chrysler” in it and Stevens Creek’s limitation of the relevant custodians to sales employees.  Accordingly, Judge Grewal ordered  Stevens Creek to ask both salespeople and all other employees who may have relevant documents to cooperate with the search and for Stevens Creek to coordinate the search for documents by telling those employees exactly what Chrysler had asked for and suggesting broad sets of search terms.

However, Judge Grewal found that Chrysler had failed to show that any contract existed between Stevens Creek and its employees requiring its employees to provide information stored in their personal accounts despite language in Stevens Creek’s handbook instructing employees to keep “internal information” in the “sole possession” of Stevens Creek. The court noted that the handbook language did not create a legal right and there was no “authority under which Stevens Creek could force employees to turn them over.”

Judge Grewal’s ruling has two important implications for employers. First, when responding to requests for electronically stored information, employers must take an active role in assisting employee-custodians in their search for responsive documents.  Second, Judge Grewal’s ruling indicates that employers should have strong agreements in place with employees who may be storing company information in personal email accounts, such as Gmail, for otherwise they may be prevented from recovering them when needed. Instead, these employees may be subject to direct, third party discovery of relevant information in their custody and control under Rule 45. This can complicate the employer’s defense and overall increase the cost of electronic discovery.

Case Example of Quick Peek Type of Production Without Full Manual Review

This is part Thirteen of the continuing series on two-filter document culling. (Yes, we are going for a world record on longest law blog series.:) Document culling is very important to successful, economical document review. Please read parts onetwothreefourfivesixseveneightnineteneleven and twelve before this one.

Limiting Final Manual Review

Ralph_talkingIn some cases you can, with client permission (often insistence), dispense with attorney review of all or near all of the documents in the upper half. You might, for instance, stop after the manual review has attained a well-defined and stable ranking structure. For example, you only have reviewed 10% of the probable relevant documents (top half of the diagram), but decide to produce the other 90% of the probable relevant documents without attorney eyes ever looking at them. There are, of course, obvious problems with privilege and confidentiality to such a strategy. Still, in some cases, where appropriate clawback and other confidentiality orders are in place, the client may want to risk disclosure of secrets to save the costs of final manual review. This should, however, only be done with full disclosure and understanding of the considerable risks involved. We do not recommend this bypass, but in some rare occasions it makes sense.

In such productions there are also dangers of imprecision where a significant percentage of irrelevant documents are included. This in turn raises concerns that an adversarial view of the other documents could engender other suits, even if there is some agreement for return of irrelevant. Once the bell has been rung, privileged or hot, it cannot be un-rung.

Case Example of Production With No Final Manual Review

In spite of the dangers of the unringable bell, the allure of extreme cost savings can be strong to some clients in some cases. For instance, I did one experiment using multimodal CAL with no final review at all, where I still attained fairly high recall, and the cost per document was only seven cents. I did all of the review myself acting as the sole SME. The visualization of this project would look like the below figure.


Note that if the SME review pool were drawn to scale according to number of documents read, then, in most cases, it would be much smaller than shown. In the review where I brought the cost down to $0.07 per document I started with a document pool of about 1.7 Million, and ended with a production of about 400,000. The SME review pool in the middle was only 3,400 documents.

Army of One: Multimodal Single-SME Approach To Machine LearningAs far as legal search projects go it was an unusually high prevalence, and thus the production of 400,000 documents was very large. Four hundred thousand was the number of documents ranked with a 50% or higher probable prevalence when I stopped the training. I only personally reviewed about 3,400 documents during the SME review. I then went on to review another 1,745 documents after I decided to stop training, but did so only for quality assurance purposes and using a random sample. To be clear, I worked alone, and no one other than me reviewed any documents. This was an Army of One type project.

Although I only personally reviewed 3,400 documents for training, I actually instructed the machine to train on many more documents than that. I just selected them for training without actually reviewing them first. I did so on the basis of ranking and judgmental sampling of the ranked categories. It was somewhat risky, but it did speed up the process considerably, and in the end worked out very well. I later found out that other information scientists often use this technique as well. See eg.Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9.

My goal in this project was recall, not precision, nor even F1, and I was careful not to over-train on irrelevance. The requesting party was much more concerned with recall than precision, especially since the relevancy standard here was so loose. (Precision was still important, and was attained too. Indeed, there were no complaints about that.) In situations like that the slight over-inclusion of relevant training documents is not terribly risky, especially if you check out your decisions with careful judgmental sampling, and quasi-random sampling.

I accomplished this review in two weeks, spending 65 hours on the project. Interestingly, my time broke down into 46 hours of actual document review time, plus another 19 hours of analysis. Yes, about one hour of thinking and measuring for every two and a half hours of review. If you want the secret of my success, that is it.

I stopped after 65 hours, and two weeks of calendar time, primarily because I ran out of time. I had a deadline to meet and I met it. I am not sure how much longer I would have had to continue the training before the training fully stabilized in the traditional sense. I doubt it would have been more than another two or three rounds; four or five more rounds at most.

Typically I have the luxury to keep training in a large project like this until I no longer find any significant new relevant document types, and do not see any significant changes in document rankings. I did not think at the time that my culling out of irrelevant documents had been ideal, but I was confident it was good, and certainly reasonable. (I had not yet uncovered my ideal upside down champagne glass shape visualization.) I saw a slow down in probability shifts, and thought I was close to the end.

I had completed a total of sixteen rounds of training by that time. I think I could have improved the recall somewhat had I done a few more rounds of training, and spent more time looking at the mid-ranked documents (40%-60% probable relevant). The precision would have improved somewhat too, but I did not have the time. I am also sure I could have improved the identification of privileged documents, as I had only trained for that in the last three rounds. (It would have been a partial waste of time to do that training from the beginning.)

The sampling I did after the decision to stop suggested that I had exceeded my recall goals, but still, the project was much more rushed than I would have liked. I was also comforted by the fact that the elusion sample test at the end passed my accept on zero error quality assurance test. I did not find any hot documents. For those reasons (plus great weariness with the whole project), I decided not to pull some all-nighters to run a few more rounds of training. Instead, I went ahead and completed my report, added graphics and more analysis, and made my production with a few hours to spare.

A scientist hired after the production did some post-hoc testing that confirmed an approximate 95% confidence level recall achievement of between 83% to 94%.  My work also confirmed all subsequent challenges. I am not at liberty to disclose further details.

In post hoc analysis I found that the probability distribution was close to the ideal shape that I now know to look for. The below diagram represents an approximate depiction of the ranking distribution of the 1.7 Million documents at the end of the project. The 400,000 documents produced (obviously I am rounding off all these numbers) were 50% plus, and 1,300,000 not produced were less than 50%. Of the 1,300,000 Negatives, 480,000 documents were ranked with only 1% or less probable relevance. On the other end, the high side, 245,000 documents had a probable relevance ranking of 99% or more. There were another 155,000 documents with a ranking between 99% and 50% probable relevant. Finally, there were 820,000 documents ranked between 49% and 01% probable relevant.


The file review speed here realized of about 35,000 files per hour, and extremely low cost of about $0.07 per document, would not have been possible without the client’s agreement to forgo full document review of the 400,000 documents produced. A group of contract lawyers could have been brought in for second pass review, but that would have greatly increased the cost, even assuming a billing rate for them of only $50 per hour, which was 1/10th my rate at the time (it is now much higher.)

The client here was comfortable with reliance on confidentiality agreements for reasons that I cannot disclose. In most cases litigants are not, and insist on eyes on review of every document produced. I well understand this, and in today’s harsh world of hard ball litigation it is usually prudent to do so, clawback or no.

Another reason the review was so cheap and fast in this project is because there were very little opposing counsel transactional costs involved, and everyone was hands off. I just did my thing, on my own, and with no interference. I did not have to talk to anybody; just read a few guidance memorandums. My task was to find the relevant documents, make the production, and prepare a detailed report – 41 pages, including diagrams – that described my review. Someone else prepared a privilege log for the 2,500 documents withheld on the basis of privilege.

I am proud of what I was able to accomplish with the two-filter multimodal methods, especially as it was subject to the mentioned post-review analysis and recall validation. But, as mentioned, I would not want to do it again. Working alone like that was very challenging and demanding. Further, it was only possible at all because I happened to be a subject matter expert of the type of legal dispute involved. There are only a few fields where I am competent to act alone as an SME. Moreover, virtually no legal SMEs are also experienced ESI searchers and software power users. In fact, most legal SMEs are technophobes. I have even had to print out key documents to paper to work with some of them.

Penrose_triangle_ExpertiseEven if I have adequate SME abilities on a legal dispute, I now prefer to do a small team approach, rather than a solo approach. I now prefer to have one of two attorneys assisting me on the document reading, and a couple more assisting me as SMEs. In fact, I can act as the conductor of a predictive coding project where I have very little or no subject matter expertise at all. That is not uncommon. I just work as the software and methodology expert; the Experienced Searcher.

Recently I worked on a project where I did not even speak the language used in most of the documents. I could not read most of them, even if I tried. I just worked on procedure and numbers alone. Others on the team got their hands in the digital mud and reported to me and the SMEs. This works fine if you have good bilingual SMEs and contract reviewers doing most of the hands-on work.


To be continued …. (final installment comes next!)

The 2015 Federal Rules Amendments: the importance of proportionality

Ann_SmithThe article is by Anne H. Smith, shareholder in our Raleigh, North Carolina office.

The 2015 amendments to the Federal Rules of Civil Procedure went into effect on December 1, 2015. They apply not only to cases filed on or after this date but pending proceedings “insofar as just and practicable.” The Amendments focused largely on e-discovery and how to tame discovery abuses in light of the electronic information explosion.

goldilocksWhile the federal rules had clearly recognized that discovery should be limited based on “undue burden” for decades, the amendments expand this concept to include “undue expense.” The amendments have ushered in a new period of discovery which Ralph Losey has characterized as the “Goldilocks Era” where judges and litigants must find that “just right” balance of discovery. The discovery cannot be too broad or too expensive, but it must be “just right” considering the needs of the case.


This proportionality theme is demonstrated by Rule 26(b)’s amendments which limit the scope of discovery to any party’s claim or defense considering the needs of the case. Previously courts could allow expanded discovery to the subject matter of the case for good cause shown.

The new amendments have deleted this provision and instead adopted a proportionality analysis by considering the following factors:

  • The importance of the issues at stake in the action;
  • The amount in controversy;
  • The parties’ relative access to relevant information;
  • The parties’ resources;
  • The importance of the discovery in resolve issues; and
  • Whether the burden or expense of the proposed discovery outweighs its likely benefits.

These criteria are not wholesale changes as they were moved from old Rule 26(b)(2)(C). The only “new” provision is the parties’ relative access to information. However, this factor was borrowed from old 26(b)(2)(B)’s provision on inaccessible data.

In most cases, the main factor which will likely be outcome determinative is whether the expense of the discovery outweighs its benefits. In other words, is the cost of discovery proportional to the value of the case? With the revisions, parties will need to assess very early in the case just how much discovery (including e-discovery) will cost and weigh that against the “real,” non-inflated amount in controversy in the case. If a case realistically has a value of $100,000, it makes little sense for a party to have to spend $50,000 on discovery. In contrast, if a case has a realistic value of $1,000,000, spending $50,000 (or more) on discovery is reasonable and proportional to the value of the case. However, you cannot evaluate proportionality unless you understand what is required to find the information which the other side has requested and then evaluate how much it is going to cost to review and produce that information.

Review of the Basic Idea of Document Culling

This is part Twelve of the continuing series on two-filter document culling. (Yes, we are going for a world record on longest law blog series.:) Document culling is very important to successful, economical document review. Please read parts onetwothreefourfivesixseveneightnineten and eleven before this one.

Review of Basic Idea of Two Filter Search and Review

CULLING.filters_SME_only_reviewWhether you use predictive ranking or not, the basic idea behind the two filter method is to start with a very large pool of documents, reduce the size by a coarse First Filter, then reduce it again by a much finer Second Filter. The result should be a much, much small pool that is human reviewed, and an even smaller pool that is actually produced or logged. Of course, some of the documents subject to the final human review may be overturned, that is, found to be irrelevant, False Positives. That means they will not make it to the very bottom production pool after manual review in the diagram right.

In multimodal projects where predictive coding is used the precision rates can often be very high. Lately I have been seeing that the second pool of documents, subject to the manual review has precision rates of at least 80%, sometimes even as high as 95% near the end of a CAL project. That means the final pool of documents produced is almost as large as the pool after the Second Filter.

Please remember that almost every document that is manually reviewed and coded after the Second Filter gets recycled back into the machine training process. This is known as Continuous Active Learning or CAL, and in my version of it at least, is multimodal and not limited to only high probability ranking searches. See: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training– Part Two. In some projects you may just train for multiple iterations and then stop training and transition to pure manual review, but in most you will want to continue training as you do manual review. Thus you set up a CAL constant feedback loop until you are done, or nearly done, with manual review.


As mentioned, active machine learning trains on both relevance and irrelevance. Although, in my opinion, the documents found that are Highly Relevant, the hot documents, are the most important of all for training purposes. The idea is to use predictive coding to segregate your data into two separate camps, relevant and irrelevant. You not only separate them, but you also rank them according to probable relevance. The software I normally use, Kroll Ontrack’s EDR, has a percentage system from .01% to 99.9% probable relevant and visa versa. A very good segregation-ranking project should end up looking like an upside down champagne glass.


A near perfect segregation-ranking project will end up looking like an upside down T with even fewer documents in the unsure middle section. If you turn the graphic so that the lowest probability relevant ranked documents are on the left, and the highest probable relevant on the right, a near perfect project ranking looks like this standard bar graph:


screen_shot_table_5percentThe above is a screen shot from a recent project I did after training was complete. This project had about a 4% prevalence of relevant documents, so it made sense for the relevant half to be far smaller. But what is striking about the data stratification is how polarized the groupings are. This means the ranking distribution separation, relevant and irrelevant, is very well formed. There are an extremely small number of documents where the AI is unsure of classification. The slow curving shape of irrelevant probability on the left (or the bottom of my upside down champagne glass) is gone.

The visualization shows a much clearer and complete ranking at work. The AI is much more certain about what documents are irrelevant. To the right is a screenshot of the table form display of this same project in 5% increments. It shows the exact numerics of the probability distribution in place when the machine training was completed. This is the most pronounced polar separation I have ever seen, which shows that my training on relevancy has been well understood by the machine.

After you have segregated the document collection into two groups, and gone as far as you can, or as far as your budget allows, then you cull out the probable irrelevant. The most logical place for the Second Filter cut-off point in most projects in the 49.9% and less probable relevant. They are the documents that are more likely than not to be irrelevant. But do not take the 50% plus dividing line as an absolute rule in every case. There are no hard and fast rules to predictive culling. In some cases you may have to cut off at 90% probable relevant. Much depends on the overall distribution of the rankings and the proportionality constraints of the case. Like I said before, if you are looking for Gilbert’s black-letter law solutions to legal search, you are in the wrong type of law.

Upside-down_champagne_2-halfsAlmost all of the documents in the production set (the red top half of the diagram) will be reviewed by a lawyer or paralegal. Of course, there are shortcuts to that too, like duplicate and near-duplicate syncing. Some of the documents in the irrelevant low ranked documents will have been reviewed too. That is all part of the CAL process where both relevant and irrelevant documents are used in training. If all goes well, however, only a few of the very low percentage probable relevant documents will be reviewed.

To be continued ….

Kulling With or Without Robots: Second Stage Predictive Coding Culling

This is part Eleven of the continuing series on two-filter document culling. (Yes, we are going for a world record on longest law blog series.:) Document culling is very important to successful, economical document review. Please read parts onetwothreefourfivesixseveneightnine and ten before this one.

Irrelevant Training Documents Are Important Too

Lexington-standupIn the second filter you are on a search for the gold, the highly relevant, and, to a lesser extent, the strong and merely relevant. As part of this Second Filter search you will naturally come upon many irrelevant documents too. Some of these documents should also be added to your predictive coding training. (That is the smart robot part of your document review software with active machine learning.) In fact, is not uncommon to have more irrelevant documents in training than relevant, especially with low prevalence collections. If you judge a document, then go ahead and code it and let the computer know your judgment. That is how it learns. There are some documents that you judge that you may not want to train on – such as the very large, or very odd – but they are few and far between,

Of course, if you have culled out a document altogether in the First Filter, you do not need to code it, because these documents will not be part of the documents included in the Second Filter. In other words, they will not be among the documents ranked in predictive coding. The will either be excluded from possible production altogether as irrelevant, or will be diverted to a non-predictive coding track for final determinations. The later is the case for non-text file types like graphics and audio in cases where they might have relevant information.

How To Do Second Filter Culling Without Predictive Ranking

KEYS_cone.filter-copyWhen you have software with active machine learning “smart robot” features that allow you to do predictive ranking, then after you find documents for training, you can from that point forward incorporate predictive ranking searches into your review. If you do not have such features, you can still sort out documents in the Second Filter for manual review, but you cannot use ranking with SAL and CAL to do so. Instead, you have to rely on keyword selections, enhanced with concept searches and similarity searches.

When you find an effective parametric Boolean keyword combination, which is done by a process of party negotiation, then testing, educated guessing, trial and error, and judgmental sampling, then you submit the documents containing proven hits to full manual review. Ranking by keywords can also be tried for document batching, but be careful of large files having many keyword hits just on the basis of file size, not relevance. Some software compensates for that, but most do not. So ranking by keywords can be a risky process.

I am not going to go into detail on the old-fashioned ways of batching out documents for manual review. Most e-discovery lawyers already have a good idea of how to do that. So too do most vendors. Just one word of advice. When you start the manual review based on keyword or other non-predictive coding processes, check in daily with the contract reviewer work and calculate what kind of precision the various keyword and other assignment folders are creating. If it is terrible, which I would say is less than 50% precision, then I suggest you try to improve the selection matrix. Change the Boolean, or key words, or something. Do not just keep plodding ahead and wasting client money.

I once took over a review project that was using negotiated, then tested and modified keywords. After two days of manual review we realized that only 2% of the documents selected for review by this method were relevant. After I came in and spent three days with training to add predictive ranking we were able to increase that to 80% precision. If you use these multimodal methods, you can expect similar results.

To be continued …

Kulling With Three Kinds of Predictive Coding Ranking Methods

This is part Ten of the continuing series on two-filter document culling. This is very important to successful, economical document review. Please read parts onetwothreefourfivesixseveneight and nine before this one.

Three Kinds of Second Filter Probability Based Search Engines

SALAfter the first round of training (really after the first document is coded in software with continuous active training), you can begin to harness the AI features in your software. You can begin to use its probability ranking to find relevant documents. There are currently three kinds of ranking search and review strategies in use: uncertainty, high probability, and random. (I use all three kinds, including other non-predictive coding searches. This combined approach can be considered a fourth multimodal method.)

The uncertainty search, sometimes called SAL for Simple Active Learning, looks at middle ranking documents where the code is unsure of relevance, typically the 40%-60% range. The high probability search looks at documents where the AI thinks it knows about whether documents are relevant or irrelevant. You can also use some random searches, if you want, both simple and judgmental, just be careful not to rely too much on chance.

CALThe 2014 Cormack Grossman comparative study of various methods has shown that the high probability search, which they called CAL, for Continuous Active Learning using high ranking documents, is very effective. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014.  Also see: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine TrainingPart Two.

My own experience also confirms their experiments. High probability searches usually involve SME training and review of the upper strata, the documents with a 90% or higher probability of relevance. The exact percentage depends on the number of documents involved. I may also check out the low strata, but will not spend very much time on that end. I like to use both uncertainty and high probability searches, but typically with a strong emphasis on the high probability searches. And again, I supplement these ranking searches with other multimodal methods, especially when I encounter strong, new, or highly relevant type documents.

SPLSometimes I will even use a little random sampling, but the mentioned Cormack Grossman study shows that it is not effective, especially on its own. They call such chance-based search Simple Passive Learning, or SPL. Ever since reading the Cormack Grossman study I have cut back on my reliance on any random searches. You should too. It was small before, it is even smaller now. This does not mean sampling does not still have a place in documents review. It does, but in quality control, not in selection of training documents. See eg. ZeroErrorNumerics.com and Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search.


Kulling Robots: Fine grained second filter culling by use of predictive coding

This is part Nine of the continuing series on two-filter document culling. This is very important to successful, economical document review. Please read parts onetwothreefourfivesixseven and eight before this one.

Second Filter – Predictive Culling and Coding

Bottom-Filter_onlyThe second filter begins where the first leaves off. The ESI has already been purged of unwanted custodians, date ranges, spam, and other obvious irrelevant files and file types. Think of the First Filter as a rough, coarse filter, and the Second Filter as fine-grained. The Second Filter requires a much deeper dive into file contents to cull out irrelevance. The most effective way to do that is to use predictive coding, by which I mean active machine learning, supplemented somewhat by using a variety of methods to find good training documents. That is what I call a multimodal approach that places primary reliance on the Artificial Intelligence at the top of the search pyramid. If you do not have active machine learning type of predictive coding with ranking abilities, you can still do fine-grained Second Level filtering, but it will be harder, and probably less effective and more expensive.

Pyramid Search diagram

dice_manyAll kinds of Second Filter search methods should be used to find highly relevant and relevant documents for AI training. Stay away from any process that uses just one search method, even if the one method is predictive ranking. Stay far away if the one method is rolling dice. Reliance on random chance alone has been proven to be an inefficient and ineffective way to select training documents. Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – – Part One and Part Two and Three and Four. No one should be surprised by that.

The first round of training begins with the documents reviewed and coded relevant incidental to the First Filter coding. You could also defer the first round until you have done more active searches for relevant and highly relevant from the pool remaining after First Filter culling. In that case you also include irrelevant in the first training round, which is also important. Note that even though the first round of training is the only round of training that has a special name – seed set – there is nothing all that important or special about it. All rounds of training are important.

There is so much misunderstanding about that, and seed sets, that I no longer like to even use the term. The only thing special in my mind about the first round of training is that it is sometimes a very large training set. That happens when the First Filter turns up a large amount of relevant files, or they are otherwise known and coded before the Second Filter training begins. The sheer volume of training documents in many first rounds thus makes them special, not the fact that the training came first.

ralph_wrongNo good predictive coding software is going to give special significance to a training document just because it came first in time. (It might if it uses a control set, but that is a different story, explained in my article Predictive Coding 3.0). The software I use has no trouble at all disregarding any early training if it later finds that it is inconsistent with the total training input. It is, admittedly, somewhat aggravating to have a machine tell you that your earlier coding was wrong. But I would rather have an emotionless machine tell me that, than another gloating attorney (or judge), especially when the computer is correct, which is often (not always) the case.

man_robotThat is, after all, the whole point of using good software with artificial intelligence. You do that to enhance your own abilities. There is no way I could attain the level of recall I have been able to manage lately in large document review projects by reliance on my own, limited intelligence alone. That is another one of my search and review secrets. Get help from a higher intelligence, even if you have to create it yourself by following proper training protocols.