Previous   Contents   Next
Issues in Science and Technology Librarianship
Summer 2001

URLs in this document have been updated. Links enclosed in {curly brackets} have been changed. If a replacement link was located, the new URL was added and the link is active; if a new site could not be identified, the broken link was removed.

Database Reviews and Reports

The Identification of Authors in the Mathematical Reviews Database

Bert TePaske-King
Manager, Bibliographic Services Department
Mathematical Reviews

Norman Richert
Administrative Editor
Mathematical Reviews
nrichert@ams.org

The name "John Smith" is actually relatively rare in the MathSciNet Database, the publication on the web of the Mathematical Reviews Database. As of the writing of this piece, there were a mere nine distinct authors of mathematical papers and books with the name "John Smith". And all have different middle initials, so they are easy to keep straight. Contrast this with the 57 authors "Wang, Li", some with additional names, some without, and the 32 authors "Wang, Wei" with no additional names. With over 60 years of mathematical literature and almost 350,000 authors, it becomes clear that maintaining author information with the goal of distinguishing all these authors is no small task.

[Matches for: 
wang, wei]

Every research community understands that an author is more than just a character string in a database. In the days when Mathematical Reviews was founded it was perhaps possible to actually know all the people working in a particular subspecialty of mathematics. Those days are long gone. Either by amazing, serendipitous luck or by wise foresight (probably a combination), from its founding in 1940, Mathematical Reviews has made an attempt to identify authors of items listed in its publications. This was done in the beginning entirely by hand, with data for each individual (published name variants, MR numbers, etc.) kept on 3x5 cards and filed alphabetically. There was additional author correspondence, using paper mail in the beginning, documenting research on author identification. The author indexes that were periodically created represented a massive piece of work, collating the information in the bibliographic listings and on the cards.

[Image of staff member filing 
cards]

The zeal with which author identification was pursued over the years is impressive. There has long been a concern for the "proper" form of a name. Consider all the variations of name that might be present in the published works of a single individual. These variations involve, for example, author taste, editorial standards, typesetting methods, name changes, and publishing in more than one language. Consider all the issues of name breaking, accents, spaces, hyphens, and alphabets. In Western style names -- "First Middle Last" -- it is easy to imagine three variations -- no middle, middle initial, full middle -- and then three more when combined with a possibly shortened first name. This is just the start. There are, for example, Ukrainian forms for names, Latvian/Lithuanian forms, the various transliteration schemes associated with non-Roman alphabets, and the various transliteration schemes associated with character-based writing.

The current champion in the "Total Number of Name Variations Contest" is "Krivoshe\u\i n, Leonid Evgen\cprime evich" (using the TeX codes). In the Mathematical Reviews Database there are 20 distinct name variations for this author. However, all the items for this author can be easily found, with no need to perform multiple searches using the various name possibilities.

This example involves only one author. The real challenge is the intersecting lists of name variations between distinct authors. One example is "Johnson, Norman". In the Search Author Database tool of MathSciNet , two authors are returned, with exactly the same list of four published name variants. The sleuthing at Mathematical Reviews proved these to be two distinct individuals.

[Mathches 
for: johnson, norman]

There was a time when Mathematical Reviews even attempted to "correct" the published form of a name, perhaps believing that some editors and publishers just didn't try hard enough. As a survivor of those days, an internal Mathematical Reviews concept is that of the "preferred name," which, in the past, meant the form of the name preferred by Mathematical Reviews. This preferred name may have a numerical superscript, in cases where there is confirming correspondence distinguishing an individual with a given preferred name string from all other individuals with that preferred name string. For example, 14 of the 32 "Wang, Wei" authors are distinguished by a superscript. The 18 others are still awaiting that correspondence.

Beginning in 1985, with the advent of the electronic author database, the process of author identification has been automated to an astonishing degree. The results of this effort, performed by the staff of MR, are available in the "Search Author Database" tool in MathSciNet, visible in the bracketed full names in the PDF form of a full item in MathSciNet, published in printed Mathematical Reviews headings, and printed in the paper author indexes. There is far more information available to MR staff to aid in the ongoing task of identifying authors. Even when the primary publication vehicle for the MR Database was paper Mathematical Reviews, this automation was a great advance.

For each author a separate "author-individual" record is maintained in the MR Database, containing each published name variant associated with an author, institutional affiliations, mathematical subject classifications assigned to the author-individual's items, coauthors, and references to items indexed in the MR Database. Each record is headed by the preferred name, which usually represents the fullest published form of an author's name that will distinguish the author-individual from others who publish using similar names; on occasion, an unpublished full name is used as a preferred name if necessary to identify the author-individual uniquely.

The process of author identification begins after the bibliographic data of an item is keyed into the (internal) MR Database. That bibliographic data includes a name string, which reflects manual breaking and other markup. The first pass in the identification process is generated automatically by a number of machine algorithms that compare a name string that appears on an item, the institutional affiliation listed for the author, and the classification for the item assigned by the MR editors against author-individuals already in the MR Database. These programs find the best possible (ideally exact) match on all three elements. The programs are successful roughly eighty percent of the time.

For the remaining twenty percent, the program makes its "best guess" on a potential match to an author-individual in the database, using preset algorithms to rank possible matches. It is on this remaining twenty percent that most of the MR staff time allotted to author identification is spent: the keyboarded name string is checked for typos or a mistaken name break; the intent of the journal in name presentation is carefully checked (journals do make errors in the presentation of first name/family name); alternate spellings are examined; bibliographies are checked for self-citations; coauthors are checked for a possible match. When all possibilities available via the item at hand are exhausted, staff use internet and web-based tools to search for authors, e.g., searching for full names at university/department web sites or for curricula vitae with lists of publications. Authors are contacted by e-mail or paper mail if necessary.

When the author database was made electronic in 1985, no attempt was made to incorporate the information associated with pre-1985 items. From the point of view of a paper publication vehicle, the gain in expedited author identification was not enough to justify the expense. The first publication in 1996 of the MathSciNet web interface to the Mathematical Reviews Database changed everything. The ease with which authors could be searched electronically provided an impetus to create additional author-individual database records from items indexed or reviewed in Mathematical Reviews prior to 1985. One of the truisms of electronic tools of all kinds is that when users see something they like, they want more of it. The ease of searching authors of post-1985 items produced a demand for that same ease in searching authors of pre-1985 items.

To bring in the pre-1985 author data, a one-time matching algorithm was run. It assigned attribution of the older items to author-individuals already in the MR Database, based on exact string matches with known variations of an author's name. The decision was made to err on the side of "splitting" personalities, rather than "collapsing" personalities. If no matching name could be found in the MR Database, a new author-individual record was created. The magnitude of the task made handwork prohibitively expensive. Needless to say, this process could not be perfect, if only because publishers provided less useful identifying information in the "old days." For example, in cases where an author used one form of her or his name on items written prior to 1985 (say "Smith, J. M."), but a different form of the name on later items ("Smith, John M."), a new author-individual record for "Smith, J. M." was created in the MR Database. Most of the work done for the earlier printed author indexes was not carried over to the electronic author database. Nor was all the information on the 3X5 cards.

The nature of the work on the pre-1985 data produced anomalies in Author Search of which serious users of MathSciNet should be aware. Most of these are multiple author records corresponding to the same individual, but there are also cases of one "MR individual" really being two or more individuals.

Tools are available to MR staff to examine and combine author-individual records and to make these changes available on MathSciNet virtually overnight. MathSciNet users have also proved to be of invaluable assistance in this endeavor, and users are encouraged to notify Mathematical Reviews of any anomalies they notice in author information presented in "Search Author Database" searches or clicking on an author name in a headline. There are about 250 such communications from MathSciNet users each year.

The results of all this work can be accessed in MathSciNet in several ways. The least desirable way to search for an author is to use the "Full Search" or "Basic Search" tools. These searches are basically string searches on the published name string of a paper (as keyed at Mathematical Reviews) and the preferred name, if it is different. Thus, for example, a search in the "Author" field of a "Basic Search" for "Ross, Kenneth" would not return any items published under "Ross, K. A." Although there are items in the MR Database for which the published author name is "Ross, K. A.", the preferred name of that author is "Ross, Kenneth Allen". A search for "Ross, Kenneth A*" would return the items for "Ross, K. A.", but it would also return the items for "Ross, Kenneth Andrew".

[Matches for: ross, kenneth a]

In most cases, the best way to search for an author is to use the "Search Author Database" tool of MathSciNet. Searches using the "Search Author Database" tool return lists of author-individuals rather than lists of papers. At that point, a selection can be made among people, rather than among papers. The listed variants of the author name may be a quick way to distinguish between the authors returned by the search.

On the other hand, this is not much help in the "Wang, Wei" example. In this case, it is best to use the "Full Search" tool, with "Wang, Wei" in the author field and some additional information: e.g., a title fragment, a Mathematics Subject Classification, or a year of publication. Once the "Wang, Wei" desired is identified in a particular item, clicking on the author name in that headline or full item will return the headline list for all the items connected to that particular "Wang, Wei." For an author whose publication record crosses the 1985 boundary, the last statement may not be accurate. The authors of the pre-1985 items can be thought of as a collection of author-individual "maybe"'s. An author search will present a list of all the possible "maybe"'s. Of course, some of these "maybe"'s may, in fact, be the same person.

Librarians are in many cases in a better position than mathematicians to fully understand the nuances explained here, and we hope that they can take a leadership role in educating the user community. A helpful discussion about author identification, as it is presented in the MathSciNet interface to the Mathematical Reviews Database, can be found in the booklet "MathSciNet--Mathematical Reviews on the Web: Guiding you through the literature of mathematics". Download a copy in PDF from http://www.ams.org/msnhtml/guidebook.pdf. A printed copy may be obtained free of charge at the AMS Bookstore, at {http://www.ams.org/bookstore-getitem/item=MRBK}. We encourage librarians to order as many copies as they think they might be able to distribute.

[Book cover]

The unique identification of authors is one of the distinguishing features of the Mathematical Reviews Database and its publication in MathSciNet. The data collected since 1940 is an invaluable resource for the mathematical research community and the information professional community. We hope that this article can help to build wider awareness of what the identification of authors involves and how MathSciNet users can take advantage of it.

Previous   Contents   Next

W3C 
4.0 Checked!