Issues in Science and Technology Librarianship
Stephanie M. Ritchie
Agriculture & Natural Resources Librarian
University of Maryland College Park
College Park, Maryland
Lauren M. Young
University of Maryland College Park
College Park, Maryland
Agricultural researchers and science librarians must understand which research literature databases provide the most comprehensive coverage of agricultural subjects to support their inquiries. Once the domain of a few specialized databases, agricultural research literature is now covered by broad, multidisciplinary databases. The purpose of this study is to determine the most comprehensive database(s) for agricultural literature searching. We compared the coverage of eight bibliographic databases for a range of agricultural sub-topics to determine how much overlap exists and which database(s) best support discovery of agricultural research literature. We found that the multidisciplinary databases provided the most comprehensive coverage, along with one of the agriculture-specific databases. This study will help researchers and librarians determine where to invest their effort and resources when looking to find agricultural research content.
Agricultural sciences cover topics ranging from production and harvesting of agricultural products, to processing and food safety, transportation infrastructure, marketing and economics, food consumption and nutrition, public policy, climate and weather, and every aspect of the food systems that occurs within and impacts the environment. Agricultural sciences covers organisms from all kingdoms, including edible and ornamental plants, farmed and wild animals, beneficial and pest fungi, and bacteria and microorganisms.
Agriculture is not only a science discipline, but a social science and at times humanity as well, as food impacts every human life. Related disciplines of sociology, business and economics, politics, psychology, and other core social science topics study how people and communities relate to food and agriculture. Food and agriculture are the subject of, and occasionally the raw materials for, art. It can be very difficult to strictly demarcate what is and is not an "agricultural" discipline.
The core academic literature in the agricultural sciences has traditionally been indexed by a few subject specialized databases (Kawasaki 2002). However, due to agriculture's interdisciplinary nature, content from agriculture is also indexed by related disciplines, and vice versa. Some agricultural databases cover a specific agricultural topic such as food science or zoology. Others attempt to comprehensively cover agriculture as a whole and include a full range of topics. More recently, large multidisciplinary databases include agricultural sciences, and affiliated topics such as economics, in one large set of scholarly information (e.g., Google Scholar and Scopus).
With the rise of a handful of broad, multidisciplinary databases, an agricultural researcher may wonder whether it is worth their effort to search specialized agricultural indexes when these multidisciplinary tools may cover agriculture and related disciplines as well or better than the specialized databases. Librarians and researchers must decide which databases meet their needs for access to a broad range of literature in agriculture. Agricultural researchers may be mainly interested in those databases that provide comprehensive coverage in their specialty, while librarians may have to balance needs of many researchers to access content that covers several agricultural topics in depth and breadth.
This article compares eight commonly used bibliographic databases, including comprehensive agricultural indexes, more specialized databases, and broad multidisciplinary databases. The purpose of this research is to determine the most comprehensive database(s) for agricultural literature searching. The most recent literature to assess the coverage of agricultural literature databases was conducted by Kawasaki (2004), and no studies on the topic have been published since broad scope multidisciplinary databases have emerged as popular research tools.
We selected eight databases to use for this study (see Table 1). The databases can be grouped into three different types: comprehensive, multidisciplinary, and specialized. The comprehensive databases, AGRICOLA, AGRIS, and CAB Abstracts, cover agriculture broadly as a subject discipline. The multidisciplinary databases, Google Scholar (GS), Scopus, and Web of Science (WoS), include agricultural research literature within a larger set of sciences, social science and even humanities literature. The two remaining databases, BIOSIS Previews and Food Sciences and Technology Abstracts (FSTA), cover specific subject disciplines that overlap with or are partially included in agriculture. A summary of each database in provided in Table 1.
Table 1: Bibliographic Databases in Agricultural Sciences
|Comprehensive||AGRICOLA||National Agricultural Library||AGRICOLA combines the catalog of the National Agricultural Library collection and an index of thousands of journals on agricultural sciences from 1970 to the present. Some historical materials from the collection (pre-1970) are also included. Formats include books, journal articles, reports, white papers, conference proceedings, multimedia and other types of special materials.||https://agricola.nal.usda.gov|
|AGRIS||United Nations Food and Agriculture Organization||AGRIS is a collaboratively compiled database of agriculture and technology literature for the agricultural sciences. Multilingual content and indexing is a unique feature of this database.||http://agris.fao.org|
|CAB Abstracts||CABI||CAB Abstracts comprehensively covers applied life sciences including agriculture from 1973 onwards (1913- in archive) with over 8 million records. Unique content includes non-English journals, books and conference proceedings, from over 120 countries. CAB is produced by an international non-profit research organization.||https://www.cabdirect.org|
|Multidisciplinary||Google Scholar||Google Scholar is a broad multidisciplinary collection of peer reviewed literature from across all disciplines and topics. Content is provided by Google Scholar partners, some of whom are publishers, and also compiled through a proprietary process developed by Google to crawl literature or partner content.||https://scholar.google.com|
|Scopus||Elsevier||Scopus compiles peer-reviewed literature across disciplines in the sciences, technology, medicine, social sciences and humanities, and also emphasizes interdisciplinary subjects. Scopus also includes a set of analytical tools for citation, journal, author, and subject field impact.||https://www.scopus.com|
|Web of Science||Clarivate Analytics||Web of Science selects the most significant journals, conference proceedings and books across a range of disciplines including the sciences, social sciences and humanities. This database includes historic citation indexing content that helps facilitate tracing ideas over time and measuring impact of scholarly work.||https://clarivate.com/products/web-of-science/|
|Specialized||BIOSIS Previews||Clarivate Analytics||BIOSIS Previews covers journals, meetings, books, and patents from the biological sciences. Content is indexed using a specialized vocabulary and MeSH terms for enhanced discovery through search.||http://wokinfo.com/products_tools/specialized/bp/|
|FSTA||International Food Information Service||Food Science and Technology Abstracts broadly collects food and health related content from journals, trade publications, books, reviews, conference proceedings, reports, and selected patents and standards.||https://foodinfo.ifis.org/fsta|
We investigated four methodologies, summarized below, to determine the most appropriate approach for our study. We particularly sought a method that would incorporate the range of specialized topics that comprises agriculture as a broad, multidisciplinary topic.
After a comparative methodology review, we decided to use a modified version of the bibliography method (method 1), including a statistically significant number of randomly selected citations chosen from comprehensive bibliographies and review articles. The bibliography method was selected because the list of sources developed using this method includes items from a large number of relevant (as reflected by inclusion in a comprehensive/review article on a current agricultural topic) sources from multiple agricultural disciplines.
We chose the following three review articles to represent a variety of disciplines within agriculture to build our bibliographic citation set for analysis.
Baum, El-Tohamy, and Gruda review literature on vegetable crop production using arbuscular mycorrhizal fungi. The literature included in this review covers agricultural sub-disciplines of plant science, horticulture, botany, biological control, mycology, nematology, soil science, as well as broader topics like microbiology, environmental science, and general agricultural sciences. Jones, et al., covers literature on sustainability metrics for diet and nutrition. The review includes sources in the sub-disciplines of nutrition and food science, and broader topics of energy, environment, economics, development, and agricultural science. Stankus, Laincz, and Linck compile meat science literature. This comprehensive review covers meat science and animal sciences, and also touches on related topics of agricultural economics and marketing, food science, and veterinary medicine.
Thirty citations were randomly selected from each of the three review articles (n=90). We searched each of these items in each of the eight evaluation databases during November 2016-January 2017. Items were most often searched using unique title strings in quotations (or as a phrase if database syntax did not allow quotations) and by a combination of title, keyword, journal source, or author field if unable to be located by unique title string. Items not found using these search strategies were considered not included in coverage for that database. Citations were considered covered by a database if a full bibliographic record was found, but not if a fragment or partial citation record (common in Google Scholar) were found.
We selected statistical methodologies to both validate the data collected and provide a comparative value for the difference (or similarity) between each database. The Chi Square test is a standard statistical method used to compare the data observed to the data expected and reveal if a significant difference exists between each variable (i.e., database content coverage for this study). To explore the correlation between the variables, we selected an overlap analysis method to yield a numerical representation for the degree of difference/similarity between each database.
We used a chi square analysis to evaluate if the expected number of citations from our sample were found in each database, or if the frequency was significantly different among the eight databases.
To evaluate our observed data compared to the expected data, we performed four separate chi square analyses using SPSS v. 23 (IBM Corporation 2016) for the data as a whole set (n=90) and for each citation set separately (i.e., meat science, sustainable diet, and agronomy). For the chi square analysis, a p<0.05 indicates a significant difference in coverage among the databases that is not due to chance.
We performed an overlap analysis to determine the similarity between each database and set of databases. We reviewed several types of similarity metrics for binary data (Choi et al. 2010) and chose the simple matching coefficient (Sokal and Michener 1958), where overlap is calculated as the number of matching attributes/number of attributes; in other words, simple matching is the number of citations that were both present or both absent in the compared databases divided by all the citations in the sample set. For example, if 8 citations were found in both databases, 12 were not found in either database, and 10 were found in one but not the other database, the simple matching coefficient would be (8+12)/(30) = .67.
The simple matching coefficient works well to determine similarity where absence of a variable is as meaningful as its presence. For example, if a researcher is unable to use a particular database to find a specific citation, that database confers the same experience as another database missing that same citation. Given we are interested in how much utility a database provides to the researcher, we are interested in similarity between databases.
One danger in selecting the simple matching coefficient is that if the comparison sets contain a large number of shared items that do not meet the experimental criteria (i.e,. both not found), it can create an artificially high index of similarity, compared to the Jaccard coefficient (Jaccard 1912) (which counts only instances where the experimental criteria are present (i.e., using the example above, 8/30 = 0.267)) or other similarity metrics. However, the methodology of this study limits the number of citations and thus, to a relatively small set of instances where a specific item would not be found in either of the databases. So instances of shared absence should not disproportionally impact the similarity index.
For each area of agricultural research, there were 28 simple matching comparisons made among the databases to compare each database to every other database. A greater simple matching coefficient indicates greater similarity among the two databases while a lesser simple matching value indicates less similarity among the two databases.
We selected two different data visualization tools to display the results so that the comparison between the databases might be highlighted in ways that numerical tables do not provide. Additionally, we wanted to explore emerging tools that allow visualizations beyond familiar bar charts and scatter plots. All of the data visualization tools selected have a low barrier to use, i.e., do not require coding or onerous training, although do require installation of special programs (in some cases) and some investment of time to learn.
Grid heat map visualizations (Figure 1) provide a visual representation of citation presence in each of the eight databases. We listed each database alphabetically and grouped by type to represent the x axis and numbered each set of 30 citations sorted alphabetically by author to create the y axis. Each colored grid square represents a citation found in the database listed at the top of that column. For columns comparing databases by type, we selected darker colors to represent greater overlap, or more density, of coverage of content across databases by topic and type. These figures were created manually in Microsoft Excel using 0s and 1s to represent absence and presence of citations in each cell. The SUM function was used to aggregate presence results by database type. To create the final grid heat maps, we applied conditional formatting using color scales to each grid cell.
We determined that a visualization should be produced to represent the distribution of article discovery across the eight databases. This type of distribution displays the number of articles that appear in one to eight database(s) sequentially and visually represents the nature of the sample set, i.e., does the sample set have a normal distribution., and if the random sample of citations is equally distributed.
To create this visualization, Gephi (Bastian et al. 2009) was chosen because of its built-in functionality for creating radial distribution graphics (Figure 2). The same coded dataset, created for use with Cytoscape, was loaded with articles identified as source nodes, databases as target nodes, and edges (lines) representing the relationships between source and target nodes. Unlike Cytoscape, style manipulation is somewhat difficult in Gephi, so labels were added in Photoshop after the visualization was created.
A set of 90 items total (comprising 30 items each from one of three review article/bibliography reference lists) were searched in each database. The total number of items found in each database by topic and in total is summarized in Table 2.
Table 2: Count of items found by topic and database from a set of 90 items
We found a very low probability that an item being found in a database was related to it being found in another database for sustainable diet (X2= 43.93, df= 7, p<0.001), agronomy (X2= 60.65, df= 7, p<0.001), meat science (X2= 27.5, df= 7, p<0.001), and overall (X2= 96.09, df= 7, p<0.001). The chi square test tells us that the differences in the distribution of citations across databases is not due to chance.
Overlap among eight databases was calculated using the simple matching coefficient where overlap is defined as the proportion of articles that were found and not found in both databases divided by the total number of articles. For the sustainable diet review article, the greatest overlaps were between Scopus and Web of Science, Web of Science and Google Scholar, and Google Scholar and Scopus. This result shows a great degree of correlation between the multidisciplinary databases. The least overlaps were between AGRICOLA and Google Scholar, and AGRICOLA and Scopus. AGRICOLA only contained seven of the 30 citations.
Table 3: Simple Matching Coefficients for a set of citations from the sustainable diet review article
For the agronomy review article, the greatest overlaps were between Google Scholar and Web of Science, CAB and Google Scholar, and CAB and Web of Science. CAB matched the multidisciplinary databases much more than the other comprehensive databases. The least overlaps were between CAB and FSTA, and FSTA and Google Scholar, a reflection of agronomy being outside of the subject scope of FSTA content.
Table 4: Simple Matching Coefficients for a set of citations from the agronomy review article
For the meat science review article, the greatest overlaps were between Scopus and Web of Science, and AGRICOLA and AGRIS. The amount of content retrieved by each of these pairs was similar. The least overlaps were between AGRICOLA and CAB, AGRICOLA and Google Scholar, and AGRICOLA and Web of Science. A strong correlation between the comprehensive databases and the multidisciplinary databases is shown in the data with the caveat that CAB resembles the multidisciplinary databases rather than the comprehensive databases.
The grid heat maps are arranged in a manner to quickly show the unique fingerprint of each database in regards to the sample citation set. Each column represents a database and each row represents a citation from the sample. Columns more densely filled indicate databases or database types with greater coverage of the sample content. The grid heat map figure displays the count of items found and summarized in Table 2 in a manner that allows for quick visual comparison of each sample citation presence across databases. The three columns showing the items found by database type for each subject topic helps identify individual citations (or possibly types of content) more likely to be found in different types of databases.
Figure 1: A visual representation of citations found by database and topic. A) Sustainable diets. B) Agronomy. C) Meat science.
The visualization in Figure 2 created with Gephi (Bastian et al. 2009) illustrates the relationships between the 90 sample citations and the eight comparison databases. Each bubble represents a database and is sized proportionally to the overall number of citations found within. Each small red dot represents a citation, and is sized and grouped according to how many times (indicated by number label) it was found in a database. Each line represents a link between a citation and a database in which it was found. Citations not found in any databases are not shown.
The visualization shows that there is a skew towards citation representation in five or six databases with fewer citations at either end of the numerical distribution (although with a slight dip in citations found in four databases). In addition to the distribution, this visualization also shows the rank of the databases in terms of the overall number of citations found within as indicated with bubble size and order.
Figure 2: A Gephi (Bastian et al. 2009) Distribution Graph visually illustrates the relationships between the 90 sample citation articles and the eight comparison databases.
The sustainable diets review article comprised literature from a wide variety of disciplines related to both diet and sustainability. Diet topics covered nutrition, medicine and food science disciplines. Sustainability literature included energy, environment, economics and international development. Although some researchers have found uneven coverage of sustainability literature in databases (Brunn 2014), Google Scholar and the other multidisciplinary databases covered literature cited by this interdisciplinary review article particularly well. Only three items were not found in Google Scholar, one was an international government report (found in none of the databases) and one was a National Research Council report found in AGRICOLA and AGRIS, but not in other sources. The last item not found in Google Scholar was a conference proceedings book published by the United Nations Food and Agricultural Organization (FAO). However, an individual chapter from within this proceeding was found with a quoted title search. Governmental organization documents may be more likely to be found in databases produced by governmental organizations.
The agronomy review article mainly consists of horticulture, plant science, and crop science articles. Sub-disciplines such as botany, mycology, nematology, and biological control were represented, as well as related disciplines of microbiology and biotechnology. A few agriculture and environmental sciences articles were covered. Google Scholar and CAB both found most of the same items. Items found by one, but not the other database tended to be older items (published pre-1990) and/or published by less well known European journals/organizations. Interestingly, one item found by Google Scholar, but not CAB, was one of the two citations found by FSTA on the topic of nutrition. Web of Science and Scopus were very similar. Scopus lacks some of the older materials covered in Web of Science, and Web of Science lacks some issues of journals otherwise covered by Scopus. BIOSIS coverage was also very similar to Scopus and Web of Science, although a few newer articles in horticulture were excluded. As expected, FSTA found very little agronomy content. AGRICOLA and AGRIS were fairly well matched in their coverage of agronomy literature. Most notably, CAB behaved much more similar to the multidisciplinary databases and less like other comprehensive databases.
The meat science review article included mainly literature from the specific meat science and animal science disciplines. A few general food sciences items, as well as a small amount of literature across veterinary science and medicine, aquaculture, economics and health disciplines were also covered. When results from CAB and Google Scholar are combined, they cover all but one of the citations in the set, a marketing trade magazine article, which was not covered by any of the databases. The citations missed by CAB and Google Scholar were from trade magazines, books, or were older or of international origin. Scopus and Web of Science both included similar content and excluded similar content. Excluded content consisted of trade magazine articles, a book and book chapter, a labor studies journal article, and a couple older meat sciences journal articles. This is likely due to the fact that content from trade magazine and books, as well as some older content, is outside of the scope for Scopus and Web of Science. FSTA found a similar number of items as Scopus and Web of Science, but excluded some of the animal sciences content and included some of the meat and food science content as would be expected due to its subject scope. BIOSIS also excluded content out of its subject specific scope, largely meat and food science articles. Where AGRICOLA varied from AGRIS, it contained book content not included in the multidisciplinary databases and CAB.
The comprehensive databases as a whole were not comprehensive in their coverage of agriculture, although CAB was much more comprehensive than AGRICOLA and AGRIS. CAB was fairly complete in its coverage of agronomy and meat sciences, covering the content better than some of the multidisciplinary databases. In the sustainable diets topic area, CAB lagged behind the multidisciplinary databases, but did provide much more coverage than the specialized or other comprehensive databases. Due to the interdisciplinary nature of sustainability, some of the subtopics represented in the citation set may be outside of the scope of CAB, or may be areas where CAB could expand coverage in the future. The superior coverage by CAB in comparison to AGRICOLA and AGRIS continues research findings from past studies (Kawasaki 2004). Neither AGRICOLA nor AGRIS covered agriculture particularly comprehensively. Strong correlation between AGRICOLA and AGRIS may be related to the sharing of content between the databases (Hood and Ebermann 1990), and scope of materials included for indexing, as both are created by government entities. The data did provide some indication that content types other than peer-reviewed journal articles were covered better in these databases than some others, but not enough data with content of varied types were included in each set to allow a statistically valid analysis. Future research on this topic may be of interest, especially as comprehensive journal article content is adequately provided by other sources.
The specialized databases each cover topics in scope very well and those topics out of scope not very well. BIOSIS covered agronomy literature fairly well, but not the sustainable diet and meat science topics that are mostly out of scope of the biological sciences. FSTA covered meat science literature fairly well. Neither database covered the interdisciplinary sustainable diets topic well, but did provide some coverage. This result is as expected, but might also identify subtopic areas to expand coverage.
The multidisciplinary databases cover most of the agricultural science literature adequately across subjects. Google Scholar covered almost all of the agricultural and interdisciplinary literature. Notable weak areas of coverage were informal and trade published items, older journal articles and books. These types of items may be excluded from Google Scholar by design and are likely to be found with the Google search engine or in Google Books. Scopus and Web of Science also found a large majority of the citations. Excluded content was similar to Google Scholar in that it was not published through scholarly sources, it was older, or it was book content.
Overall, the data show that multidisciplinary databases cover agricultural subject content very well and interdisciplinary topics especially well. Researchers can be fairly confident they will find the majority of scholarly content using these databases. Notably, CAB is the one database that focuses on agriculture that covers it comprehensively. Other agricultural specific databases such as AGRICOLA or AGRIS do not provide comprehensive coverage of agricultural literature, but do offer content that is outside of the scope of large scholarly databases and of interest to agricultural researchers. Specialized topical databases can complement other database searches, but should only be considered if a research topic is well within the scope of the database. Most agricultural researchers should be able to find adequate content in multidisciplinary databases and may only need to rely on supplemental searches for research projects that require completeness such as review articles and bibliographies.
As libraries and academic institutions weigh decisions about which research tools including databases are necessary to support institutional research and scholarship, this study provides information about where researchers can find adequate agricultural subject content. This study focused on finding known items within the selected databases. Follow-up studies to determine how well the search features of these databases function to help researchers locate agricultural information by subject will complement the results of this study in determining which research tools best support discovery of agricultural literature.
Bastian, M., Heymann, S. & Jacomy, M. 2009. Gephi: an open source software for exploring and manipulating networks. In: Third International AAAI Conference on Weblogs and Social Media. 2009 May 17-20; San Jose, CA. p. 361-362. [accessed 2017 Oct 16]. https://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154
Baum, C., El-Tohamy, W. & Gruda, N. 2015. Increasing the productivity and product quality of vegetable crops using arbuscular mycorrhizal fungi: A review. Scientia Horticulturae 187:131-141. DOI: 10.1016/j.scienta.2015.03.002
Baykoucheva, S. 2010. Selecting a database for drug literature retrieval: A comparison of MEDLINE, Scopus, and Web of Science. Science and Technology Libraries 29(4):276-288. DOI: 10.1080/0194262X.2010.522946
Brown, B.N. 2007. A comparative analysis of ecology literature databases. In: Special Libraries Association: Issues and Innovations in Biomedical and Life Sciences Librarianship Contributed Papers. Denver, CO. [accessed 2017 Oct 16]. http://dbiosla.org/events/sla_conference/papers/brownpaper.pdf
Choi, S-S., Cha, S-H. & Tappert, C.C. 2010. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics/ 8(1):43-48. [accessed 2017 Jul 17]. http://www.iiisci.org/journal/sci/FullText.asp?var=&id=GS315JG
Griffith, B.C., White, H.D., Drott, M.C. & Saye, J.D. 1986. Tests of methods for evaluating bibliographic databases: An analysis of the National Library of Medicine's handling of literatures in the medical behavioral sciences. Journal of the American Society for Information Science 37(4):261-270. DOI: 10.1002/(SICI)1097-4571(198607)37:4<261::AID-ASI12>3.0.CO;2-6
Grindlay, D,J,C,, Brennan, M.L. & Dean, R.S. 2012. Searching the veterinary literature: A comparison of the coverage of veterinary journals by nine bibliographic databases. Journal of Veterinary Medical Education 39(4):404-412. DOI: 10.3138/jvme.1111.109R
Hood M.W. & Ebermann, C. 1990. Reconciling the CAB thesaurus and AGROVOC. Quarterly Bulletin of the International Association of Agricultural Information Specialists 35(4):181-185.
IBM Corporation. 2016. IBM SPSS Statistics for Windows, Version 23.0.
Jones, A.D., Hoey, L., Blesh, J., Miller, L., Green, A. & Shapiro, L.F. 2016. A systematic review of the measurement of sustainable diets. Advances in Nutrition: An International Review Journal 7(4):641-664. 10.3945/an.115.011015
Kawasaki, J.L. 2002. Indexing of core agriculture serials. Quarterly Bulletin of the International Association of Agricultural Information Specialists 47(2):33-37.
Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B. & Ideker, T. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 13(11):2498-2504. DOI: 10.1101/gr.1239303
Sokal, R.R. & Michener, C.D. 1958. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38:1409-1438. [accessed 2017 Jul 17]. https://archive.org/details/cbarchive_133648_astatisticalmethodforevaluatin1902
Stankus, T., Laincz, J. & Linck, R. 2015. Reviews of science for science librarians: Meat science around the world, 1980-2014. Science & Technology Libraries 34(3):167-227. DOI: 10.1080/0194262X.2015.1072491
This work is licensed under a Creative Commons Attribution 4.0 International License.