The body of scientific knowledge is not only growing, it’s accelerating. More and more papers are published every day, at an astonishing rate—estimated at about 7000 per day.[1] And yet, by some early estimates, only 1% of studies in the biomedical literature meet the minimum criteria for scientific quality[2].

So when a physician tries to update herself on the best therapies for a given disease, she not only has to slog through all of those papers, but also hone in the “good” ones that really provide pivotal results and avoid the research that wasn’t done rigorously.

Clearly there is a need to be able to filter the good research from the not-so-good, and to do so as automatically as possible (without losing any results) in order for the doctor to find the best research with the least amount of effort.

A group of us from Evid Science, the University of Utah and McMaster University recently proposed a machine learning approach to this problem in our paper,  “A Deep Learning Method to Automatically Identify Reports of Scientifically Rigorous Clinical Research from the Biomedical Literature: Comparative Analytic Study.”

The original paper was meant for biomedical informatics professionals, but the topic is applicable for data scientists and aspiring data scientists alike. Here we’ll summarize what we did and why we did it, and perhaps even inspire the next generation of work to classify the scientific soundness of medical papers.

What’s Already Been Done?

Given the prevalence of the problem, people have tried to address it before. However, many of the previous attempts rely on hand-crafted features, such as combinations of bibliographic measures, or rely on manually entered features, such as MeSH [1]index terms.

The most popular, state-of-the-art approach is the set of Clinical Query filters provided within PubMed.[2] Clinical Query filters are combinations of text-words and MeSH terms[3] that can return a higher proportion of rigorous papers than just PubMed searching alone. And while they are in standard usage currently, and do a good job, they fundamentally rely on human-provided MeSH terms, a process which can take annotators between a week and almost a year after publication to add. That is a long time to wait for an assessment! (On top of that, some of the Clinical Query filters leave room for improvement.)

"At its core, the problem is one of classification." 

Other approaches do use machine learning[3-5]. However, they use particular hand-crafted features (e.g., variants on MeSH terms, UMLS concepts, etc.) and other features such as bibliographic measures, which themselves may be proprietary data to a company (and therefore are not freely available). Which is all to say, they use features that go beyond the text and therefore present the problems of availability (e.g., can you even access bibliographic data for every paper?) and over-fitting (e.g., how can you be sure these features will generalize to other cases?).

However, every paper has an abstract and a title—or at least it should. So if you could just use those as features, that would make an approach as widely applicable (and hopefully useful) as possible. And that is the main point of our paper – we don’t use anything else.

Instead, we describe an “end-to-end” deep learning system, which turns the deluge of papers to an advantage. Our approach simply uses titles and abstracts, learning which words and phrases are indicative; and given enough (even noisy) data, our approach can find signal to make good decisions about which papers are sound and which are not.

So How Does It Work?

At its core, the problem is one of classification. Is this paper rigorous or not? Or, to make the decision more flexible: what is the probability that this paper is rigorous? (This let’s us explore operating thresholds, etc.)

As a classification problem, the input is a title and abstract, concatenated together, into a single blob of text. And the output is a probability that the article is scientifically rigorous.

A popular approach to solve such text classification problems is to use a Convolutional Neural Network (CNN). More specifically, we use what is commonly called the “Kim model.”[6]

The CNN model is shown in Figure 1 below. Instead of using a string of characters to represent each word in its input, the CNN uses a multi-dimensional vector, called an embedding. Embeddings are useful because they can encode semantic information about a word (they are learned from a separate model, such as "Word2Vec" which is pre-trained on millions of documents and billions of words).

The CNN uses "sliding windows" of various sizes to transform the input text into a single compact vector. In the picture below, the word embeddings are represented as having 6 dimensions, with 3-word windows (green), and 2-word windows (red). Each window slides over the text, and outputs a number for every position. You can think of a window as recognizing a "feature" of the input sequence, with its output number indicating how strongly that feature is present. For example, a feature might indicate that a word related to experiment is followed by a word related to success, and would only output a high number for text positions where that is true.

The CNN has 300 of these features (learned during training).  So by sliding each window over the text, and recording its maximum value, the CNN can transform the entire input text into a single 300-dimensional "feature vector," representing which features are most present in the text. Each feature is then scaled according to how important it is for the desired classification, and then summed with all the other features to produce a final output classifier score (e.g., for the more experienced, the vector is classified using a simple linear model and softmax layer to produce a final output classifier score).


Figure 1: A CNN for text classification

While we are not the first to take this approach, we did find a novel way to train our model. In particular, a CNN such as this requires substantial training data. In order to learn useful features (via the deep learning "backpropagation" algorithm), we needed (many) examples of articles that were deemed scientifically sound and those that were not. Fortunately, these CNN approaches are robust to noise – that means that if we could find enough training data, even if it had some noise (e.g., misclassified examples), the model should be able to pick out enough signal to perform well.

To create our training data, we used the PubMed API and built two collections of articles (again, just getting their title and abstract). As we mentioned, there is already a Clinical Query called a “Narrow filter” for therapy articles. It relies upon MeSH terms and text-words, but it is able to identify articles that are scientifically sound some of the time.


Using these calls and filters, we limited ourselves to gathering 150,000 articles for the “positive” set and 300,000 for the “negative” set. After removing articles with no abstract, we ended up with 147,182 positive and 256,034 negative articles. We then set aside 90% of the articles for training, and 10% for “development” (e.g., fine-tuning the model).

So our real innovation in the paper was finding the right model to apply to the problem and then figuring out how to train it.

Does It Work?

Once we picked the right model and trained it, our next challenge was to see if it really worked. We could have tested against our PubMed API data, but as mentioned, the Clinical Queries we used to build the training data are noisy. They sometimes flag a rigorous article as not, and vice versa, so it was unreliable for testing. Therefore, we needed a clean data set to test against.

Fortunately, our colleagues at McMaster provide just such a data set, called Clinical Hedges. The Hedges data set consists of 50,594 articles published in 170 journals, which were manually classified as being scientifically sound or not (with very high fidelity). Within the Hedges data, we selected the 1,524 scientifically sound studies that focused on therapies (not in our training data), and 29,144 non-sound articles (also not in our training data either). While these weren’t enough articles to train our system, they were certainly enough for evaluating (also, they were used to evaluate the Clinical Queries when they were developed, providing a good benchmark).

Given this test data, we then constructed three different test scenarios, mirroring three different use cases of the medical literature.

Use-Case 1: Developing Evidence-Based Syntheses of the Medical Literature

In this scenario, the goal is to retrieve every possible paper (since otherwise a synthesis could be considered incomplete), but filter out as many non-rigorous papers as possible (since a synthesis wouldn’t consider any paper that isn’t rigorous science). In terms of metrics, we want as perfect recall as possible (e.g., how many true sound articles we get) while maximizing precision (e.g., the filtering out of non-rigorous papers). So we compared our CNN model to the Clinical Queries Broad filter, which was tuned for recall. While Broad does have higher recall (98.4% vs. 96.9%), the CNN model dramatically outperforms in precision (34.6% for CNN vs. 22.4% for Broad, +12.2% absolute difference!). This is roughly a 50% improvement and means you only need to filter through half as many non-rigorous papers by hand using the CNN versus the Broad filter.

Use-Case 2: Literature Surveillance

Physicians and scientists often track specific topics in the literature, such as particular treatments or disease updates. This is known as literature surveillance, and the idea is to get updated about an important new result in as timely a fashion as possible. However, the need remains to filter the scientifically sound articles from those that are not. And this poses a challenge because the PubMed Clinical Queries rely on the human-annotated MeSH terms, which, as we mentioned, can take a long time to appear. (Other models, which rely on bibliographic metrics, for example, are also problematic because such metrics take time to develop). In this case, a model like ours becomes quite important since it only relies on features of the text itself.

For this scenario, we compared our CNN to a “text-words only” filter developed by McMaster for just such a circumstance. And here, our CNN had equivalent recall to McMaster’s text-word filter (e.g., returns the same number of good articles), but significantly higher precision (+22.2%, which means it filtered out many more non-rigorous ones).

Use-Case 3: Patient-care decision making

For patient care decision-making, the goal is to really balance recall and precision. Ideally, you maximize both at the cost of neither so you are getting as many relevant articles as you can, without filtering out too many good ones too. The most common filter used here is the Clinical Queries Balanced filter. In this scenario, our CNN had similar recall, but lower precision (-6.3%) than McMaster’s Balanced filter. So there is still some work to do for this particular scenario.


Overall, our model provided four important benefits:

1. It only looks at abstracts and titles, so it doesn’t rely on features that might be unavailable due to access (e.g., you need a subscription) or timeliness (e.g., no MeSH terms yet or no bibliographic metrics yet).

2. Since it provides a probability of soundness, it can rank papers according to this probability. For instance, the precision of the 50 highest-ranking was 70%! (This is particularly important if you consider that most physicians don’t look past the first 20 citations returned in PubMed).

3. We showed that even with a very noisy training set (we estimated roughly 50% false-positives in our training) we still built a reasonable model.

4. Because it is learned from examples, it can be improved by finding additional high quality examples, which is often easier than creating complex search filters.

What’s most clear after this work is the fact that this problem can actually be solved. With all of the amazing (and speedy) progress in machine learning, it’s possible for us to actually get to the day when someone can say, “Remember when we had to sort through these papers by hand? Researchers today have it so easy…”

Happy data mining.

Acknowledgements: The authors wish to thank Mike Ross for his helpful comments and edits, which contributed to making this article as readable as possible.


  1. "STM Report 2015 Final 2015-02-20 - International Association of STM ...." 20 Feb. 2015, Accessed 10 Jul. 2018.
  2. Haynes RB. Where's the meat in clinical journals? ACP Journal club 1993;119(3):A22.
  1. Kilicoglu H, Demner-Fushman D, Rindflesch TC, Wilczynski NL, Haynes RB. Towards automatic recognition of scientifically rigorous clinical research evidence. J Am Med Inform Assoc 2009;16(1):25-31
  2. Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF. Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 2005;12(2):207-216
  3. Bernstam EV, Herskovic JR, Aphinyanaphongs Y, Aliferis CF, Sriram MG, Hersh WR. Using citation data to improve retrieval from MEDLINE. J Am Med Inform Assoc 2006 Jan;13(1):96-105
  4. Kim Y. arXiv. 2014. Convolutional neural networks for sentence classification   URL:

[1] MeSH stands for Medical Subject Headings, and it is an ontology of medical related terms and their relationships


[3] Annotators for the National Library of Medicine apply MeSH terms to papers that they deem most appropriate in a process known as “indexing.”

About the Authors

Matthew Michelson
Matthew Michelson is the CEO of Evid Science, a technology company using AI to make access to medical evidence as simple and seamless as possible. He is an expert in machine learning and natural language processing, having published more than 30 peer-reviewed scientific papers. He is also an accomplished technologist, having helped design and launch data products used by customers such as the Department of Defense, Target, Dow Jones, and the largest pension fund in California.

Guilherme Del Fiol, MD, PhD
Guilherme Del Fiol is Associate Professor of Biomedical Informatics at the University of Utah. He is an expert in clinical informatics, clinical decision support (CDS) systems and health information technology standards. As one of the co-chairs of the CDS Work Group at Health Level Seven (HL7), he has led the development of interoperability standards for the implementation of CDS capabilities within electronic health record (EHR) systems. His research includes investigating ways to integrate the best available clinical evidence within clinicians’ workflow to support decisions in the care of specific patients.

Matt Michelson and Guilherme Del Fiol, MD, PhD

Related Content