5  Classifying research

5.1 Introduction

Research is not a monolith, and what we might want to call the global research community is in fact divided into several communities and sub-communities who operate as more or less independent fields, in the Bourdieusian sense. Terms like “field”, “research area”, “domain”, “disciplines”, “specialties”, and “subject categories” are more or less synonymous and refer to categories of knowledge. In this chapter, we will use the term “discipline” to refer to such groups, and we will discuss their importance for bibliometrics, and the challenges associated with delineating them and identifying what discipline(s) an entity (e.g., publication, researcher, journal) belongs to.

What are disciplines

The following quote highlights a few key points about disciplines and what defines them:

Disciplines are defined (in part) and recognized by the academic journals in which research is published, and the learned societies and academic departments or faculties within colleges and universities to which their practitioners belong” (Wikipedia)

5.2 Classification(s)

A classification is essentially a list of categories in which entities can be classified (what disciplines exist). Many classifications have been developed by organizations worldwide, and the best classification to use is often determined by availability as well as the subject and goals of the analysis. Below are a few examples of general classifications (that cover all disciplines) and disciplinary classifications (that cover single disciplines with a higher level of granularity than the general classifications).

5.2.1 General classifications

  • The Science Metrix classification of research outputs categorizes scientific journals and articles in 5 domains, 20 fields and 174 subfields. The classification can be downloaded here.

  • The Scopus All Science Journal Classification (ASJC) divides journals in 334 fields and 4 research areas. The list of ASJC fields can be found here.

  • The National Science Foundation (NSF) classification is a tried and tested mutually exclusive classification used in the Science & Engineering Indicators since the 1970s. It was originally designed by CHI Research (Archambault, Beauchesne, and Caruso 2011). It contains three layers: 2 domains, 14 fields, and 143 specialties.

  • The Web of Science Subject Categories is a journal-level non-exclusive classification of journals in 250 subject categories available in the Web of Science. More information can be found here.

  • The Field of Science and Technology (FOS) classification of the OECD has 40 FOS grouped in six broad fields. Details can be found here. The Web of Science provides a mapping of the FOS and the Web of Science subject categories here.

  • The fields of research (FOR) from the Australian and New Zealand Standard Research Classification (ANZSRC). The FOR have three levels: divisions, groups, and fields. You can find more details on the ANZSRC website. This classification is used by the Dimensions database to assign FOR to articles and to calculate the field citation ratio (FCR), which we will discuss again in chapter 7.

5.2.2 Disciplinary classifications

Some disciplines have developed their own classification. Here are some examples.

  • The Medical Subject Headings (MeSH) are used in PubMed and Medline databases to facilitate searching (details here).

  • The Mathematics Subject Classification (MSC) (details here).

  • The Journal of Economics Literature (JEL) classification in economics (details here)

5.2.3 Computational (bottom up) approaches

There exists a variety of computational approaches to divide any set of entities into groups that can (but do not always) make sense. Topic models are an example. That said, topic models are not that popular in bibliometric studies, for which researchers tend to adopt citation-based network approaches (discussed in more detail below).

5.3 Classifying

Assigning one or multiple discipline to documents or other entities can be a challenging task. Depending on the objective of your analysis, it may be preferable to use the classification already available in the database you are using (or to choose a database that uses the classification that best suits your needs).

But how do we assign a discipline to another entity, like a researcher?

  • Do we use the discipline of their PhD?

  • Do we use the discipline of their current department or faculty?

  • Do we use the discipline classification of their articles or the journals in which they are published?

There is no right answer to these questions. Most often our main guide will be data availability and quality. The discipline of the Ph.D., for instance, is an information rarely available other than on the CV of the researcher. Furthermore, not all bibliographic records include the department of the authors, and when they do, they do not usually use controlled vocabulary so the same departments can come up with different names or spelling, and sometimes it will require some web searching to figure out the discipline to which the department could be assigned to. Moreover, the department names might not match your disciplinary classification, making pairing departments with disciplines challenging. Suppose the publications in our database are assigned to one or many disciplines. In that case, we can infer the researcher’s discipline using the papers they published or the journals in which they published. What do we do when a researcher has five publications in Physics, three in chemistry and four in economics? Is that person a Physicist, a Chemist, an Economist, or a mix?

5.4 Describing disciplines

How can we represent disciplines (topics of interest, journals, researchers) using bibliometrics? The “simple” answer is that we can describe a discipline by looking at the entities we put into that box. For example, to describe the management field, we can look at the articles or journals assigned to the management discipline through some classification mechanism. Then we can look at different entities associated with these articles, such as terms used in the articles (as a proxy for topics), journals, researchers, institutions, and countries to describe what the discipline is about, what are its main journals, and who are the agents involved in it. This can be done in two main ways: rankings and networks.

5.4.1 Rankings

Bibliometric data is usually asymmetrically distributed so that:

  • A minority of researchers publish the majority of papers (Lotka’s law)

  • A minority of journals publish the majority of works on a subject (Bradford’s law)

  • A minority of words used account for the majority of occurrences (Zipf’s law)

  • A minority of articles (of a researcher, a journal, a discipline, etc.) account for the majority of the citations in the whole (Larivière et al. 2016).

Because of this, disciplines or other areas of research are sometimes described by identifying the most frequent keywords, journals, authors, and institutions. Because these types of analyses are easily done with the Web version of databases like Web of Science and Scopus, they are very popular among scholars outside of the bibliometrics field who often use the ranking methods to provide an empirical account of the main topics and actors involved in their area of expertise. These descriptions are however very limited and are ideally used with complementary approaches such as networks.

Note

Listing most frequent words used in titles and abstracts of articles published in a field without any other processing will not produce a meaningful representation of the topics of interest in a field given the presence of stop words (a, the, it, when, if) and other generic terms (research, data, results, analysis, etc.). So, it is important to filter out those words so the top words included in a table can adequately reflect the core topics of the discipline one is trying to describe.

What is a topic?

Topic and terms are not equivalent, and a topic is usually represented by a set of terms. For instance, the COVID-19 pandemic topic could be represented by terms like coronavirus, COVID, COVID-19, Omicron, etc. This is something to keep in mind as you dive into the data and try to determine what topics are of interest within a discipline or any set of publications.

Delineating a research topic

Now might be a good time to note that so far, we discussed mostly disciplines and sub-disciplines, which, as we just saw, can be characterized by the terms (or topics) most frequently found in the articles published in the discipline. But what if one is interested in analyzing all the research on a given topic, irrespective of the discipline?

This is where information-searching skills can be useful because there is no better way to gather publications on a topic than querying the database with all the necessary keywords and filters to achieve the best possible recall and precision. Of course, there are multiple approaches to this, such as writing a very broad query that maximizes recall and filters out irrelevant publications or writing a very specific query that maximizes precision, perhaps at the expense of recall.

The point is that classifications are not useful when they do not include a category that represents the body of literature that one wants to study.

One fun thing about choosing a research topic as your object of study is that you can use disciplinary classification to analyze the diversity of disciplines that are interested in the topic.

5.4.2 Disciplines as networks

As we saw in Chapter 3, disciplines can be understood as networks of agents and research objects that are connected to one another to some degree. Different kinds of networks are typically used in bibliometrics:

5.4.2.1 Co-occurrence networks

Co-occurrence networks are undirected networks in which the edges are determined by the appearance of two entities in a set. Typical types of co-occurrence networks include:

  • Term co-occurrence networks are constructed by considering how many articles two terms appear together (co-occur). For example, if an article’s title, abstract, or keywords contain the terms “COVID-19” and “vaccine,” then the terms “COVID-19” and “Vaccines” are linked.

  • Bibliographic coupling networks. Bibliographic coupling occurs when two articles contain a reference to the same article. For example, if article B and C both cite article A, then B and C are linked.

  • Co-citation networks. A co-citation between two articles occurs when they are both cited together in another article. For example, if article B and C are cited by articles A, then B and C are related. One of the issues with co-citations is that they take time to accumulate, so networks of recent articles are less accurate since the citations have yet to form links between the articles.

5.4.2.2 Direct citation network

Direct citation networks are the most common form of directed networks that one will come across in bibliometrics. In a directed network, the relationship between two nodes has a direction. When article A cites article B, a direct citation link from article A to article B is created. This is not the same as a direct citation link between article B and article A. In fact, it is extremely unlikely that A will cite B AND that B will cite A. Indeed, articles usually cite other articles that have already been published, which means that the cited articles cannot cite the citing articles back.

5.5 References

Archambault, Eric, Olivier H Beauchesne, and Julie Caruso. 2011. “Towards a Multilingual, Comprehensive and Open Scientific Journal Ontology.” In, 66–77. Leiden, Netherlands: Noyons, B., Ngulube, P., & Leta, J. (Eds.).
Larivière, Vincent, Véronique Kiermer, Catriona J. MacCallum, Marcia McNutt, Mark Patterson, Bernd Pulverer, Sowmya Swaminathan, Stuart Taylor, and Stephen Curry. 2016. “A Simple Proposal for the Publication of Journal Citation Distributions.” http://dx.doi.org/10.1101/062109.