Issues in Science and Technology Librarianship
Jerry E. Gray
Michelle C. Hamilton
Margaret M. Janz
Justin P. Peters
School of Library and Information Science
In scientific and academic circles, the value of Google Scholar as an information resource has received much scrutiny. Numerous articles have examined its search ability, but few have asked whether it has the accuracy, authority and currency to be trustworthy enough for scholars. This article takes a look at reliability factors that go into Google Scholar citation counts, selection of resources, and its commercial partnerships. Research data culled from correspondence with Google Scholar, analysis of citation metrics, metadata and search processes, and an appraisal of its strengths and weaknesses compared to other science-specific indexes led to the conclusion that Google Scholar may be useful for initial and supplemental information gathering, but lacks a deeper reliability than other existing services currently provide scholars. Advice is offered to science librarians about how to regard Google Scholar as a research tool.
Google Scholar (GS) as a valid and appropriate resource for scholarly information is naturally the topic of much discussion at the reference desk. In a 2005 study, 24% of North American Research Libraries included Google Scholar in their database lists and fewer than 20% of academic libraries recommended it as a search engine for research (Mullen & Hartman 2006). Librarians and information professionals have an innate professional interest in determining the scholarly nature of information resources, and science librarians in particular need to be aware of Google Scholar's strengths and weaknesses as a scientific information tool. Empirical data -- whether through experimentation or observation -- and their subsequent analyses serve as the foundation for scientific knowledge. Criteria and models for data collection have been laid out by masters of the field and disseminated to the scientific community through scholarly literature. The peer review system of this literature validates and supports the consensus practice by serving as the decisive body between science and pseudoscience. Science librarians have a duty to inform users on the importance of recognizing valid scientific resources by their adherence to standards found in scientific inquiry and reasoning. Content found through indexing services, in academic repositories, and on government web sites undergoes some level of scrutiny, either by peer review, editorial process, or selection standards, ensuring a level of accountability for the quality of what is presented. However, with the enduring popularity of search engines like Google Scholar that do not conform to a validation process, the line between scholarly science literature and pseudoscience is no longer so clear.
To better understand the methods and approaches of the company, several e-mails were sent to Google Scholar. The authors requested information regarding the inclusion guidelines by which papers are made searchable and other aspects of Google Scholar's practices. Attempts to communicate with the company yielded little information beyond what is currently available on the About Google Scholar web pages. No additional information was provided regarding which sources are crawled for indexing, how citation information is gathered, or what, if any, business partnerships exist between Google Scholar and publishers. Google is a private business and as such it is their prerogative whether to disclose this information. However, information professionals expect a certain amount of accountability from any organization claiming to offer access to scholarly resources.
Science librarians are skilled in making distinctions between scholarly information and that which may be popular, irrelevant, or unreliable. They understand that the nature of science is crucial because it is "intrinsically linked to the retrieval and evaluation of scientific information" (Lascar & Mendelsohn 2011a). Accuracy, authenticity, and currency, are three of the most important qualities of scholarly scientific literature, as upheld by the McLean v Arkansas Documentation Project 2005 with this description in the following qualifications: "It is guided and explained by natural law, it is testable against the empirical world, conclusions are tentative, and it is falsifiable." (Lascar & Mendelsohn 2011a). The validity of scholarly resources lies in the understanding of the historical perspective in the evaluation of science and recognition of appropriateness when used to support scientific arguments (Lascar & Mendelsohn 2011b). When judging a piece "scholarly" in science literature, the content of the article supersedes the place where it resides and its description in metadata.
As Google Scholar would not respond to e-mail requests for their definition of "scholarly", the authors formed one using a combination of statements from the About Google Scholar web pages. It can be assumed that the crawler used to locate scholarly information simply uses the metadata and the electronic location (universities, repositories, journals, "other websites") to determine the scholarly nature of electronic documents (About Google Scholar [updated 2012]). Presumably, if a paper on a personal web site has the correct metadata it can be included, whether scholarly or not. Google Scholar does not differentiate between science and pseudoscience, nor does it differentiate between the sciences and arts and humanities. It seems to characterize scholarly information by its location and associated metadata and, at the same time, it cannot control the quality of papers located while using their product.
Much of the research published to date regarding Google Scholar has focused on the comparison of Google Scholar to a variety of different science databases. Comparisons between PubMed, BIOSIS Previews, SciFinder, Web of Science, and others have been done fairly comprehensively. The data collected have sufficiently covered the efficacy or shortcomings of Google Scholar as a search device when contrasted with existing engines. What has not been examined is the ethos behind the popular search engine. Since academic and scientific circles have an expectation for transparency and access to scholarly information, research questions that should be explored include:
Google Scholar was not forthcoming with this information during the research process.
As Mayr and Walter state in "Studying Journal Coverage in Google Scholar" (2008), "Google Scholar is a freely available service with a familiar interface similar to Google Web Search." The results retrieved from a Google Scholar search are an important aspect of the research tool to be evaluated and considered. Previous studies state and the authors' efforts show that Google does not reveal specific sources used to create its Google Scholar index (Mayr & Walter 2008; Harzing 2010; Adriaanse & Rensleigh 2011). On the Google Scholar Help page it states that "Google Scholar includes journal and conference papers, theses and dissertations, academic books, pre-prints, abstracts, technical reports and other scholarly literature from all broad areas of research (Scholar Help [updated 2012])." It also says that it includes "works from a wide variety of academic publishers, professional societies and university repositories, as well as scholarly articles available anywhere across the web...court opinions and patents." Repeated e-mails to Google Scholar requesting a list of journals and other sources that were crawled were met with no clear answers and only provided links to Google Scholar help documentation and the About Google Scholar pages. While the available information is helpful, the purpose of contacting Google Scholar was to obtain more in-depth and precise information regarding the search engine and the sources it presents in results, something that Google Scholar was unable or unwilling to provide. As of writing, Google Scholar has added a list entitled "Top publications in English" as part of the Google Scholar Metrics page (Top Publications [updated 2012]), but as the authors discovered in their e-mails to Google Scholar, a complete list is not available or one would not be provided . Requests for information regarding ranking criteria were also answered unsatisfactorily. This unwillingness to provide concrete answers about the search results creates concerns about how comprehensive Google Scholar search results are and may leave an already skeptical academic audience to fear the possibility of results manipulation.
One reason Google Scholar gave for not providing further assistance or a full list of source material was that it is a free service. When asked, "Are there standard sources for these articles that are crawled on a regular basis? If so, what are these sources? (Hauser )", Google Scholar responded "This is covered in the help pages, and most of it is obvious if you do a few searches. Sorry, we aren't able to assist you in great detail, it's a free service (Google Scholar Team )" The attitude from Google Scholar seems to be that users should just be happy that this service is available for free and users are not entitled to direct answers regarding its sources. PubMed is another freely available resource provided by the National Center for Biotechnology Information and the US National Library of Medicine. PubMed is funded by tax revenue and makes a full list of indexed journals readily available (Journal Lists [updated 2012]) out of responsibility to its investors: tax payers. Arguably, the ad revenue that Google as a company relies on is provided by consumers of its products and consequently ought to reconsider this apparent cavalier "love it or leave it" philosophy.
Google Scholar makes an interesting statement regarding the coverage of publications, which again raises concerns about its lack of any meaningful source description. On the Google Scholar Metrics page they write, "since Google Scholar indexes articles from a large number of websites, we cannot always tell where (or if!) a particular article has been published" (Google Scholar Metrics [updated 2012]). This simple sentence is a major red flag. Quality scholarly material is that which has been vetted in some way to ensure quality, accuracy and authority. While much of what a search in Google Scholar retrieves is scholarly and relevant there is still a chance that a Google Scholar search will return some material that is not up to par with the authority and quality expected by scientists and researchers. Google Scholar attempts to minimize errors in identifying publications by controlling which types of documents are included for the metrics algorithms, but the process is still flawed. If this tool cannot tell where or if an article has been published, how can one know if the material found in Google Scholar is accurate, valid or carries any authority in a field of research?
Google Scholar's citations counts are another factor to consider. Egghe and Rousseau argue that a high citation count not only implies use of that document, but also adds credence to the work and demonstrates acceptance in a particular field of research by saying that citations of an article imply that references are made up of the best literature on the topic and that the content of the articles being cited is related to the topic of the article (1990). Google Scholar's compilation of these citation counts is therefore important as these citation counts can influence which articles are selected and included in background coverage by researchers.
Citation counts in Google Scholar only include material published in the last five years and only include journal articles which follow Google Scholar's Inclusion Guidelines and a small number of manually identified conference papers and pre-prints. Google Scholar also states on its web site that it excludes some items, specifically, "court opinions, patents, books and dissertations; publications with fewer than 100 articles published between 2007 and 2011; and publications that received no citations to articles published between 2007 and 2011" (Google Scholar Metrics [updated 2012]). Google Scholar does not give specific reasons for its decisions to include or exclude certain items other than to "avoid misidentification of publications." These inclusion/exclusion decisions can affect citation counts since what is included in the metrics is left to Google Scholar and what they decide is able to be included.
Citation counts in Google Scholar can be affected or influenced in a variety of ways. Some publishers may choose not to allow Google Scholar to crawl their material: "ACS, as one of the dominating publishers in chemistry, does not cooperate with Google which causes a significant loss of citations and prohibits the use of GS in citation analysis" (Bornmann et al. 2009). Other publishers that allow Google Scholar to crawl their material may have a distinct advantage or disadvantage for being discovered through Google Scholar based on their resources. The inclusion guidelines, even as described by Google Scholar, are long and dense. Publishers and content owners with robust technical staff and resources are more likely to meet these technical inclusion guidelines and consequently increase the inclusion and rankings of their content on Google Scholar. Content that was published before GS's inclusion guidelines, or by smaller publishers with fewer resources, may be excluded or be ranked lower than larger publishers. This could lead to a loss of comprehensiveness for scholarly searches.
Although Google Scholar excludes some material in an effort to keep citation counts accurate, it still crawls items such as student handbooks, library guides, and editorial notes (Schultz 2007; Harzing 2010) for use in citation counts, so long as these documents follow Google Scholar's inclusion guidelines. These materials might not be considered scholarly or worthy of a true citation. This can lead to inflated values for the citation counts and, in turn, overstate the academic value of the article and change the way it is ranked in Google Scholar's results lists.
The Google Scholar Metrics page provides some information about how citation counts are used to rank journals (Google Scholar Metrics [updated 2012]). Currently Google Scholar metrics cover documents published between 2007 and 2011 and are "based on all articles that were indexed in Google Scholar as of April 1st 2012. This also includes citations from articles that are not themselves covered by Scholar Metrics" (Google Scholar Metrics [update 2012]). Google Scholar also states that "Scholar Metrics are based on our index as it was on April 1st, 2012. For ease of comparison, they are NOT updated as the Scholar index is updated." Currency is often of major importance and concern for scientists and Google Scholar's metrics are not necessarily kept up to date. While the currency of Google Scholar's journal rankings and citation counts are not the primary concern of most, exaggerated or inaccurate citation counts can be misleading.
With these concerns about how search results and journals are ranked, and the questions surrounding the authority of documents retrieved, Google Scholar has faced continued scrutiny from many information professionals since it was launched in 2004. Several tests have been conducted to compare Google Scholar's performance with proven reputable search databases. These comparisons have shown that, compared to subject-specific search databases, Google Scholar does not offer a clear advantage over PubMed, Web of Science (Mikki 2010; Adriaanse & Rensleigh 2011; Garcia-Perez 2010), BIOSIS Previews (Kirkwood & Kirkwood 2011), or Compendex (Meier & Conkling 2008). Google Scholar offers comparable precision and produces many shared relevant results as these resources and also retrieves many articles not included in most indexes (Walters 2011). When searching for science and medical information created after 1996, Google Scholar retrieves much of the same information as PubMed, BIOSIS, SciFinder, and other science databases (Levine-Clark & Kraus 2007). In a comparison between BIOSIS and Google Scholar, out of the 110 articles BIOSIS retrieved 56% were also found in Google Scholar. Both of these searches were limited to material published after 1996 (Levine-Clark & Kraus 2007). Because there are many different ways to search across Google Scholar and other resources, and because the number of retrieved results varies significantly between them, data from these studies can be difficult to interpret. Table 1 illustrates this using findings comparing Google Scholar to PubMed from Walters' 2011 study comparing Google Scholar to eight search databases.
|Precision Percentage||Number of Records Retrieved||Total Relevant Records Retrieved|
|Table 1: Google Scholar versus PubMed|
Re-creations of some of the searches used in older comparisons have shown that Google Scholar has increased its coverage and improved its functionality over the years and in 2011 Kirkwood & Kirkwood suggest that in life sciences research Google Scholar has better accuracy than BIOSIS Previews in searches that depend on geographic location.
Google Scholar also offers a potential added value in the amount of documents it retrieves that are not included in bibliographic indexes. In 2010 Mikki searched for citations of 26 authors in Google Scholar and ISI Web of Science. Of the results found by both resources, 69% of were uniquely retrieved by Google Scholar. In a practical application searching for specific 19th century botany citations, GS found more accurate results than Web of Science's Cited Reference search. Google Scholar's uniquely retrieved items served to fill research gaps and demonstrated the value of Google Scholar for obscure research topics. While Google Scholar seems to excel is in its "long tail of minor relevant items" (Mikki 2010), Google Scholar ranks its results using the complex algorithms of a search engine. Unlike most literature indexes, GS offers no way to sort results. The limiters offered by Google Scholar allow for some control of retrieval, but with no way to sort items and a lack of a thesaurus, from an information professional's point of view, Google Scholar leaves something to be desired (Cecchino 2010). It should not be forgotten, however, that few items retrieved uniquely by Google Scholar are scholarly documents, and fewer are likely to be of significance.
Because many researchers are familiar with standard Google searching, the interface, and natural language searching, Google Scholar may be preferable to the broad variety of interfaces, sophisticated search functions, and standardized vocabularies of well established databases/indexes. Because of this and the breadth of coverage found in Google Scholar, it may be a good starting point for researchers. Google Scholar's retrieval of nontraditional and open source publications also provides information that may not be found in PubMed, BIOSIS Previews, Web of Science, or other search databases, making GS a tool that could prove valuable throughout the research process. While irritating to many librarians, Google Scholar, by not allowing users to reorder search results, potentially allows researchers to notice older but relevant documents that are frequently obscured by default ordering of traditional database search results.
Skepticism of Google Scholar is merited. Google Scholar is lacking as a scholarly search tool because, first and foremost, it is not an abstracting and indexing service like the bibliographic databases frequently recommended by librarians. Those databases have literature indexed, often by humans, allowing it to be categorized with a controlled vocabulary and subject headings. Google Scholar is a search engine and as such it searches the full text, bibliographic information, and metadata of electronic documents. The computer programming that allows this to happen lacks the objective eye of a human indexer and, consequently, data is interpreted incorrectly and questionable sources pass through algorithms. Google Scholar's methods of document retrieval are contrary to librarians' understanding and expectation of information organization. Google Scholar's inability or unwillingness to elaborate on what documents its system crawls and the uncertain quality of Google Scholar's performance provides further reasons for information professionals and researchers to be wary of this tool, especially when so many quality databases exist and seem to sufficiently meet scientific information needs.
Over the eight years of Google Scholar's existence, many librarians have formed the professional opinion that if Google Scholar is used, it should not be the only resource used in academic research. This holds true for most resources, depending on the user's information needs. Likewise, it is well established in the information field that the user needs will and should dictate which resources are employed. Google Scholar could minimize its criticisms among information professionals and improve the user experience by providing more transparency about its inclusion process, improving the ability to limit searches by subject, and by allowing search results to be sorted and reordered, rather than ranked. In the meantime, because of its popularity and the supplemental value it affords, librarians should include Google Scholar in library instruction to familiarize users with its functionality as well as its limitations. Users should be advised to be critical of information found by any means, and be perhaps more vigilant with Google Scholar. Even with Google Scholar being included in instruction, library users and researchers will continue to ask what librarians think of Google Scholar. Often this inquiry will be met with a canned response cautioning against Google Scholar or indicating a clear preference for traditional databases. Rather than giving their usual response, perhaps science librarians in particular should respond to inquiries about their opinion of Google Scholar with "It depends on what you're doing."
The authors would like to acknowledge Brian Winterman for his helpful input and guidance throughout this research process.
About Google Scholar.. [Updated 2012 Apr 5]. Google Scholar. [Internet]. Available from: http://scholar.google.com/intl/en/scholar/about.html
Adriaanse, L.S. and Rensleigh, C. 2011. Comparing Web of Science, Scopus and Google Scholar from an environmental sciences perspective. South African Journal of Library & Information Science [Internet]. [Cited 2012 Jun 20]; 77(2):169-178. Available from: http://www.sabinet.co.za/abstracts/liasa/liasa_v77_n2_a8.html
Bornmann, L., Marx, W., Schier, H., Rahm, E., Thor, A., and Daniel, H.D. 2009. Convergent validity of bibliometric Google Scholar data in the field of chemistry: citation counts for papers that were accepted by Angewandte Chemie International Edition or rejected but published elsewhere, using Google Scholar, Science Citation Index, Scopus, and Chemical Abstracts. Journal of Informetrics [Internet]. [Cited 2012 Jun 20]; 3(1):27-35. Available from: http://www.sciencedirect.com/science/article/pii/S1751157708000667
Cecchino, N.J. 2010. Google Scholar. Journal of the Medical Library Association [Internet]. [Cited 2012 Jun 20]; 98(4):320-321. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2947134/
Egghe, L. and Rousseau, R. 1990. Introduction to informetrics : quantitative methods in library, documentation, and information science. Amsterdam ; New York: Elsevier Science Publishers, p. 216-217.
García-Peréz, M.A. 2010. Accuracy and completeness of publication and citation records in the Web of Science, PsycINFO, and Google Scholar: a case study for the computation of h indices in Psychology. Journal of the American Society for Information Science & Technology [Internet]. [Cited 2012 Jun 20]; 61(10):2070-2085. Available from: http://onlinelibrary.wiley.com/doi/10.1002/asi.21372/full
Google Scholar Metrics. [Updated 2012 Apr 5]. Google Scholar. [Internet]. Available from http://scholar.google.com/intl/en/scholar/metrics.html
Google Scholar Team. 2012 May 21. Google Scholar Sources. [Personal e-mail]. Accessed 2012 Jun 25.
Harzing, A.W. 2010. The publish or perish book : your guide to effective and responsible citation analysis. First ed. Melbourne: Tarma Software Research Propriety Limited, 165-174.
Hauser, A. 2012 May 21. Google Scholar Sources. [Personal e-mail]. Accessed 2012 Jun 25.
Journal List. [Updated 2012 Jun 20]. Bethesda (MD): National Center for Biotechnological Information. PubMed Help. [Internet]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.pubmedhelptable45/?report=objectonly
Kirkwood, H.P. and Kirkwood M.C. 2011. Researching the life sciences: BIOSIS Previews and Google Scholar. Online [Internet]. [Cited 2012 Jun 20]; 35(3):24-28. Available from: http://pqasb.pqarchiver.com/infotoday/access/2341237991.html?FMT=ABS&FMTS=ABS:FT:PAGE&type=current&date=May%2FJun+2011&author=Hal+P+Kirkwood+Jr%3BMonica+C+Kirkwood&pub=Online&edition=&startpage=24&desc=Researching+the+Life+Sciences%3A+BIOSIS+Previews+and+Google+Scholar
Lascar, C., Mendelsohn, L.D. 2011a. The evolution-intelligent design debate: a meaningful context for teaching the nature of science in information literacy. Part 1: historical background and philosophical considerations. Science & Technology Libraries [Internet]. [Cited 2012 Jun 20]; 30(4):354-371. Available from: http://www.tandfonline.com/doi/abs/10.1080/0194262X.2011.626338
Lascar, C. and Mendelsohn, L.D. 2011b. The evolution-intelligent design debate: a meaningful context for teaching the nature of science in information literacy. Part 2: social boundaries and cognitive models. Science & Technology Libraries [Internet]. [Cited 2012 Jun 20]; 30(4):372-386. Available from: http://www.tandfonline.com/doi/abs/10.1080/0194262X.2011.626339
Levine-Clark, M. and Kraus, J. 2007. Finding chemistry information using Google Scholar: a comparison with Chemical Abstracts Service. Science & Technology Libraries [Internet]. [Cited 2012 Jun 20]; 27(4):3-17. Available from: http://www.tandfonline.com/doi/abs/10.1300/J122v27n04_02
Mayr, P. and Walter, A.K. 2008. Studying journal coverage in Google Scholar. Journal of Library Administration [Internet]. [Cited 2012 Jun 20]; 47(1/2):81-99. Available from: http://www.tandfonline.com/doi/abs/10.1080/01930820802110894
Meier, J.J. and Conkling, T.W. 2008. Google Scholar's coverage of the engineering literature: an empirical study. The Journal of Academic Librarianship [Internet]. [Cited 2012 Jun 20]; 34(3): 196-201. Available from: http://www.sciencedirect.com/science/article/pii/S0099133308000335
Mikki, S. 2010. Comparing Google Scholar and ISI Web of Science for earth sciences. Scientometrics [Internet]. [Cited 2012 Jun 20]; 82(2):321-331. Available from: http://www.springerlink.com/content/u86v2042qu654130/
Mullen, L.B. and Hartman, K.A. 2006. Google Scholar and the library web site: the early response by ARL libraries. College & Research Libraries [Internet]. [Cited 2012 Jun 20]; 67(2):106-122. Available from: http://crl.acrl.org/content/67/2/106.full.pdf+html
Scholar Help. [Updated 2012 Apr 5]. Google Scholar. [Internet]. Available from: http://scholar.google.com/intl/en/scholar/help.html
Shultz, M. 2007. Comparing test searches in PubMed and Google Scholar. Journal of the Medical Library Association [Internet]. [Cited 2012 Jun 20]; 95(4):442-445. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2000776/
Top Publications. [Updated 2012]. Google Scholar. [Internet]. Available from: http://scholar.google.com/citations?view_op=top_venues&hl=en
Walters, W.H. 2011. Comparative recall and precision of simple and expert searches in Google Scholar and eight other databases. portal: Libraries and the Academy [Internet]. [Cited 2012 Jun 20]; 11(4):971-1006. Available from: http://muse.jhu.edu/journals/portal_libraries_and_the_academy/v011/11.4.walters.html