A big issue in science is the difficulty of finding relevant texts to reference. When I need to locate a paper for this blog or research of my own, I generally check Google Scholar, the campus library, journals, and institutions.

However, the first two typically return thousands of results needing to be manually sorted through, whereas the last two return only a handful of results, but of a limited scope. So where’s the balance between good coverage and managable size?

Yu et al tackle this issue in their paper, proposing a citation prediction process broken down into two stages: splitting papers between “term buckets” and scoring potential references based on a suite of extended relationships.

Google Scholar. While exploring the factors that lead to the unmanagability of existing search results, Yu et al count two million papers matching “link prediction” when using Scholar. These results far overestimate the number of relevant papers, let alone being impossible to traverse by hand.

This problem stems from the very nature of keyword-based engines like Scholar—too many papers incidentally contain both the terms “link” and “prediction,” although not necessarily being related to the subject of “link prediction.”

But calculating better results with databases as large as Google’s, regardless of their computing power, takes far more time than the researcher is willing to wait.

Where’s the Balance. Yu et al decide the best route involves first subdividing their database into “buckets” of papers that share a common set of search terms.

This technique is straightforward, but requires a few nuances in order to continue behaving correctly as more and more papers are added to the database. The authors solve this problem by allowing the buckets to grow, gaining both new papers and new search terms over time.

Furthermore, terms are not selected to produce the largest number of results, but to most accurately predict who a researcher might cite if he or she had knowledge of the entire database.

Scoring Potential References. Once this subdivision is complete, a computer can spend more time with a smaller set of potential leads, allowing it to perform more complicated calculations.

Yu et al combine a handful of existing measures of relevance, such as H-Score, with a carefully selected suite of extended relationships, such as:

  • written by someone who co-wrote a paper with someone who co-wrote a paper with the researcher
  • contains a keyword contained by a paper that contains a keyword chosen by the researcher
  • and published in a journal that published a paper written by someone who co-wrote a paper with the researcher

Allowing the computer to “learn” which relationships to value highly, they consistently predict citations more accurately than existing methods.

Imagine Yu feeding the computer a paper it hasn’t seen yet: the machine outputs a list of potential citations, and Yu scores it on how closely the results matched the paper’s actual citations. Repeating this process with more papers, a search engine could learn to compete with others in terms of accuracy, although I assume the database used in Yu et al’s work is far from the size of Google’s.

Google Scholar, get in on this. ∎

Read more in this series and follow me on Twitter. Let’s chat sometime.