Abstract
According to popular belief, Big Data and Machine Learning provide a brand-new approach to science that has the potential to revolutionize scientific progress (Hey et al., 2009; Kitchin, 2014). The extreme version of this belief is illustrated by Anderson’s claim that Big Data and Machine Learning in science will lead to the “end of theory” (Anderson, 2008). The idea behind this extreme version of the belief is that advanced Big Data and Machine Learning algorithms
enable us to mine vast amounts of data related to a given problem without prior knowledge and we do not need to worry about causality, as correlation is all that is needed.
The extreme version of this belief is not seriously held by many philosophers of science, but
there are several serious attempts to determine the extent to which Big Data and Machine Learning imply a resurgence of inductive methods (Pietsch, 2021) or agnostic science (Napoletani et al., 2021). Without claiming that “the theory came to its end”, these approaches advocate new scientific methods that can be applied to various fields in a similar way, without the need for domain-specific knowledge.
Two questions arise in connection with these views: Where did these ideas come from? and to what extent are they justified? The first question could be addressed by following the hype around Big Data and Machine Learning in industry and how easily conversations about new innovations and disruptions translate into conversations about paradigm shifts in science. The leakage of the style of argumentation and the attraction to hype, from industry to science,
should be carefully watched and frequently questioned. In these times of pressure to bridge the gap between business and science, it can be difficult to distinguish valuable insights and ideas from superficial buzz-talk. The case of Big Data and Machine Learning is one of the areas, which shows how important it is to clearly map the flow of knowledge between business and science. We do not, however, make any scientific progress towards this question, we only sketch our observations here.
Regarding our second question, we argue that using methods from Big Data and Machine Learning is not a passive mode where you feed raw data into a Big Data or Machine Learning algorithm and wait for the algorithm to detect correlations between features of a massive dataset. We argue that there is always work in manipulating data, cleaning data, etc. that requires significant domain knowledge of scientific applications.
We agree that expert domain knowledge used in Big Data and Machine Learning is of a different kind from that required by traditional methods. The question we ask is about the required amount of expert knowledge needed for Big Data and Machine Learning methods to function in an efficient manner. By carefully assessing examples of Big Data and Machine Learning 46 Abstracts
applications in science, such as skin cancer detection (Esteva et al., 2017), protein folding (Senior et al., 2020) and language generation (Brown et al., 2020), we assess which knowledge plays a role in each of these cases. We observe that significant domain knowledge is needed, not so much in theory and model building, but in data preparation and validation of Machine
Learning models. Data needs to be labelled appropriately, decisions need to be made about which data to include, and new features might need to be created. Furthermore, we need to know what constitutes a well-functioning Machine Learning algorithm for a given problem, what we should measure and what is a good enough value to constitute a solution to the problem.
In addition to expert knowledge about data samples, specific knowledge of Machine Learning is also often needed, as algorithms cannot be applied blindly in practice. A promising model architecture needs to be selected, appropriate data augmentation techniques need to be applied to increase the performance of the algorithm, and the algorithm needs to be tuned and adjusted to get good enough Machine Learning performance.
We do not intent to dismiss new scientific methods or new lines of research, but to disclose what work and knowledge the new methods require, and thus show in more detail what is new and what is business as usual. Big Data and Machine Learning methods may be used in a more agnostic way, but they do not lead to completely agnostic science. It is not a question of changing or revolutionizing science, but of expanding the methodological toolkit. This will certainly not necessarily revolutionize all science, but it could lead to changes in subfields and provoke the emergence of new fields or endeavors, such as digital humanities or computational social science.
In our talk, we will discuss concrete case studies (skin cancer detection, protein folding, language generation), where we present methods that are used and we highlight those moments where expert knowledge is involved. We will attempt to classify various aspects of expert knowledge involved in the application of Big Data and Machine Learning methods, for instance, the expert knowledge necessary at the training data sample preparation or the expert
knowledge necessary for choosing algorithms. We will also suggest that, depending on the field, the range of traditional methods vs agnostic methods varies, leading us to believe that the process of “agnosticization” is different from field to field, and that the possibility to reach the “no-theory” stage varies depending on domain. Consequently, we observe that the way in which
Big Data and Machine Learning methods enter scientific methodology involves continuous small conceptual shifts rather than a rigid paradigm shift in Kuhn‘s sense.
enable us to mine vast amounts of data related to a given problem without prior knowledge and we do not need to worry about causality, as correlation is all that is needed.
The extreme version of this belief is not seriously held by many philosophers of science, but
there are several serious attempts to determine the extent to which Big Data and Machine Learning imply a resurgence of inductive methods (Pietsch, 2021) or agnostic science (Napoletani et al., 2021). Without claiming that “the theory came to its end”, these approaches advocate new scientific methods that can be applied to various fields in a similar way, without the need for domain-specific knowledge.
Two questions arise in connection with these views: Where did these ideas come from? and to what extent are they justified? The first question could be addressed by following the hype around Big Data and Machine Learning in industry and how easily conversations about new innovations and disruptions translate into conversations about paradigm shifts in science. The leakage of the style of argumentation and the attraction to hype, from industry to science,
should be carefully watched and frequently questioned. In these times of pressure to bridge the gap between business and science, it can be difficult to distinguish valuable insights and ideas from superficial buzz-talk. The case of Big Data and Machine Learning is one of the areas, which shows how important it is to clearly map the flow of knowledge between business and science. We do not, however, make any scientific progress towards this question, we only sketch our observations here.
Regarding our second question, we argue that using methods from Big Data and Machine Learning is not a passive mode where you feed raw data into a Big Data or Machine Learning algorithm and wait for the algorithm to detect correlations between features of a massive dataset. We argue that there is always work in manipulating data, cleaning data, etc. that requires significant domain knowledge of scientific applications.
We agree that expert domain knowledge used in Big Data and Machine Learning is of a different kind from that required by traditional methods. The question we ask is about the required amount of expert knowledge needed for Big Data and Machine Learning methods to function in an efficient manner. By carefully assessing examples of Big Data and Machine Learning 46 Abstracts
applications in science, such as skin cancer detection (Esteva et al., 2017), protein folding (Senior et al., 2020) and language generation (Brown et al., 2020), we assess which knowledge plays a role in each of these cases. We observe that significant domain knowledge is needed, not so much in theory and model building, but in data preparation and validation of Machine
Learning models. Data needs to be labelled appropriately, decisions need to be made about which data to include, and new features might need to be created. Furthermore, we need to know what constitutes a well-functioning Machine Learning algorithm for a given problem, what we should measure and what is a good enough value to constitute a solution to the problem.
In addition to expert knowledge about data samples, specific knowledge of Machine Learning is also often needed, as algorithms cannot be applied blindly in practice. A promising model architecture needs to be selected, appropriate data augmentation techniques need to be applied to increase the performance of the algorithm, and the algorithm needs to be tuned and adjusted to get good enough Machine Learning performance.
We do not intent to dismiss new scientific methods or new lines of research, but to disclose what work and knowledge the new methods require, and thus show in more detail what is new and what is business as usual. Big Data and Machine Learning methods may be used in a more agnostic way, but they do not lead to completely agnostic science. It is not a question of changing or revolutionizing science, but of expanding the methodological toolkit. This will certainly not necessarily revolutionize all science, but it could lead to changes in subfields and provoke the emergence of new fields or endeavors, such as digital humanities or computational social science.
In our talk, we will discuss concrete case studies (skin cancer detection, protein folding, language generation), where we present methods that are used and we highlight those moments where expert knowledge is involved. We will attempt to classify various aspects of expert knowledge involved in the application of Big Data and Machine Learning methods, for instance, the expert knowledge necessary at the training data sample preparation or the expert
knowledge necessary for choosing algorithms. We will also suggest that, depending on the field, the range of traditional methods vs agnostic methods varies, leading us to believe that the process of “agnosticization” is different from field to field, and that the possibility to reach the “no-theory” stage varies depending on domain. Consequently, we observe that the way in which
Big Data and Machine Learning methods enter scientific methodology involves continuous small conceptual shifts rather than a rigid paradigm shift in Kuhn‘s sense.
Original language | English |
---|---|
Publication date | Oct 2021 |
Number of pages | 2 |
Publication status | Published - Oct 2021 |
Event | 6th International Conference on the History and Philosophy of Computing - ETH Zürich, Zürich, Switzerland Duration: 27 Oct 2021 → 29 Oct 2021 https://hapoc2021.sciencesconf.org/ |
Conference
Conference | 6th International Conference on the History and Philosophy of Computing |
---|---|
Location | ETH Zürich |
Country/Territory | Switzerland |
City | Zürich |
Period | 27/10/2021 → 29/10/2021 |
Internet address |