1 What is data science?
1.1 Learning objectives
- What is data science?
- How to draft a problem statement?
- Where to find data?
1.2 What is data science?
According to a short perspective paper by Blei and Smyth (2017), which I recommend reading, data science:
focuses on exploiting the modern deluge of data for prediction, exploration, understanding, and intervention. It emphasizes the value and necessity of approximation and simplification. It values effective communication of the results of a data analysis and of the understanding about the world that we glean from it. It prioritizes an understanding of the optimization algorithms and transparently managing the inevitable trade-off between accuracy and speed. It promotes domain-specific analyses, where data scientists and domain experts work together to balance appropriate assumptions with computationally efficient methods.
1.3 The data science process
Problem statement: The data science process is filled with decisions. And there is no better way to get lost and frustrated than to not have an adequate and shared understanding of the problem that needs to be solved or the knowledge gap that needs to be filled. Problems are all around us, but not all of them are good data science problems. Good data science problem are relevant (they have a clear purpose) and are solvable with available data. At the end of this step, you have a clear research objective and clear research questions.
Data collection involves a series of steps aimed at gathering all the data that you will need for your project, such as finding relevant data sources, importing the data, and assessing its suitability for the problem at hand. At the end of this step, you have at hand all the data pieces that you will need for your project.
Pre-processing (tidying) involves structuring your data in a format suitable for analysis, and cleaning your data to remove errors, duplicates, etc. At the end of this step, you have a data set that is ready to produce valid answers to your research questions.
Analysis is about describing, analyzing, and visualizing the data. At the end of this step, you have produced tables and graphs that are informative in the context of your problem statement.
Interpretation is about assigning meaning to the analyzed data and draw conclusions from it. At the end of this step you have answers to your research questions.
Communication is about transferring the new knowledge to its intended audience(s). At the end of this step, you have a clear, transparent, and effective report.
1.4 Writing a problem statement
In the context of this course, the problem statement encompasses the identification of a problem/knowledge gap that has some relevance, the project objectives, and the research questions.
1.4.1 Context
The context introduces the issue that needs to be solved or the knowledge gap that needs to be filled (what is it?) and an explanation of its relevance (why does it matter?).
Here is an example:
In Quebec, academic research is supported by national and provincial research councils that select, after peer review, the individuals or teams that receive funding. The number of researchers that are able to receive research funds is constrained by the limited funds available and the size of the grants. Past research showed that 20% to 45% of Quebec’s researchers had no external funding between 1999 and 2006, while 10% of researchers accumulated between 50% and 80% of the available funds. While we know how funding is distributed, we do not know how optimal that distribution is for producing research output and impact. Optimizing our funding policies and programs could increase the production of scientific knowledge required to solve local, national and global issues.
1.4.2 Objectives
The objective, of course, is to fix the problem or fill the gap identified in the problem statement. But here the goal is expressed as a research objective. It also provides information on the data set, and more details that help delineating the project. Here is an example:
The goal of this project is to help determine the optimal distribution of National research funds by applying data science methods to analyze data on the research funding, output and impact of Quebec researchers over a period of 15 years (1998-2012).
1.4.3 Research questions
The research questions are the expression The best research questions tend to start with the words “how, why, what, which”. Here is an example.
1) What is the relationship between the amount of research funding of individual researchers and their research output?
2) What is the relationship between the amount of research funding of individual researchers and their research output?
3) How does the relationship between funding and research impact and output vary between research fields?
Note that depending on your project, your questions may look quite different. For instance, in this example research funding and field are selected as predictors based on past knowledge and theory, while research output and research impact are the predicted variables. You should always know what you are trying to predict, but perhaps you have a lot of potential predictors in your data and your goal is to identify which ones are good predictors. In this case, you could have a question that looks like:
What are the best predictors of X?
What matters most is that:
- Your questions are clear.
- Your questions can be answered with data
- You actually provide an answer to the questions in your report.
1.5 Getting data
If you are working for a company your client or employer will most likely have an internal database storing information related to its activities (e.g., clients, products, inventories, sales, employees, financial performance, etc.) which you may be using for your data science project.
1.5.1 Open government data
Governments and public organizations are also increasingly making the data they collect openly accessible for the benefit of the public. Here are some sources:
- Canada (https://open.canada.ca/en/open-data)
- Nova Scotia (https://data.novascotia.ca/)
- United States (https://www.data.gov/)
- World Bank (https://data.worldbank.org)
- Toronto Public Library (https://opendata.tpl.ca/)
1.5.2 Research data
The Open Science movement also emphasizes the importance for researchers to share the data that they used for their published work, which can be found in repositories such as:
- DataCite (https://datacite.org/). This is an aggregator that allows you to search hundreds of research data repositories.
- Zenodo (https://zenodo.org/).
- Figshare (https://figshare.com/).
1.5.3 Bibliographic records
Bibliographic records and other metadata related to different types of works can be used for data science projects. However, because of their enriched metadata and their inclusion of bibliometric indicators like citations, citation indices provide more opportunities. Examples of citation indices are:
Scopus (available through the Dal libraries)
OpenAlex. The search engine does not allow you to easily download data, but there is a free API that can be used quite easily in R with the openalexR package (we will learn how to use R and R packages in Chapter 2 and how to use an API in Chapter 3).
Google Scholar. The easiest way to download data from Google Scholar is to use the Publish or Perish software.
1.5.4 Miscellaneous datasets
There is an overwhelming amount of data available on the Web, so here is a non-exhaustive list of data sources that you might find useful.
- Kaggle (https://www.kaggle.com)
- Awesome public datasets (https://github.com/awesomedata/awesome-public-datasets)
- Internet Movie Database (IMDb) (https://www.imdb.com/interfaces/)
Please note that you are free to use any data you wish for this course, the only restriction being that you must be able to share the data with your instructor.
1.6 Homework
Your homework is to start gathering ideas and data for your research project proposal. This includes:
Thinking about a topic for your research project.
Finding data sources that might be suitable for that topic.
Starting a draft of your problem statement.