Home / Data Requirements for Artificial Intelligence in Manufacturing

Data Requirements for Artificial Intelligence in Manufacturing

Ready to enter the world of Big Data? Before AI can effectively learn a model of a manufacturing process, the data requirements must be properly designed and fed enough examples of the right data for the algorithms to automatically learn complex patterns from a sample of data. Here are some insights to help you get prepared.

Posted: November 22, 2019

AI algorithms learn complex patterns from a sample of data – typically between a few thousand and several million values – that represent trends inside the history of a factory process. For example, AI can effectively learn a model of a process from an apparent relationship between temperature and time parameters, where every one degree increase in process temperature is accompanied by a decrease in process time of ten seconds. It does so automatically, assuming that it is designed correctly and fed enough examples of the right data. (first view)

The ‘right’ data for AI-enabled process optimization sufficiently describes how changes to process parameters affect quality. The data should explain (1) a high-level description of the physical process; (2) a description of the flow of production through the process (normally in the form of a process flow diagram), including the time offsets between process steps in some contexts; and (3) a description of how the data table(s) relate to the process. (first view)

BY JORIS STORK

At the core of today’s state-of-the-art artificial intelligence (AI) algorithms is the ability to learn complex patterns from a sample of data. When considering AI, it’s important to understand what the data requirements are at the outset because this data sample is a representation of the history of the factory process. In the manufacturing context, one type of pattern might be the ways in which a set of parameters contained in that data – typically between a few thousand and several million values that are related to a process in a factory – vary together. For example, if a trend exists in the sample to the effect that every one degree increase in the process temperature tends to be accompanied by a decrease in process time of ten seconds, the AI will learn this apparent relationship between the temperature and time parameters. In this way, the AI effectively learns a model of the process. It does so automatically, assuming that it is properly designed and fed enough examples of the right data.

WHAT IS THE RIGHT DATA FOR AI?
What constitutes the ‘right’ data for AI-enabled process optimization? The general answer is the set of data that is sufficient to describe how changes to process parameters affect quality. The bulk of process data can generally be represented as a table or a collection of tables that are comprised of columns (parameters) and rows (production examples that represent, say, one production batch per row). To be meaningful as a representation of a process – or more specifically, of the history of a process – these tables need to be accompanied by some explanatory information. In other words, before discussing the data requirements in terms of those tabular columns and rows, first take a look at the three key pieces of explanatory information that are necessary and required:

A high-level description of the physical process.
A description of the flow of production through the process (normally in the form of a process flow diagram), including the time offsets between process steps in some contexts.
A description of how the data table(s) relate to the process.

Some of these descriptions can be obtained from available technical documentation. In most cases, however, the necessary insights can be learned by walking through the data tables with specialists from the factory or process equipment. Due to the nature of AI-enabled parameter optimization, there are some clear fundamentals that the bulk of the data – the data tables – need to satisfy. Let’s outline these fundamentals by first examining the terms of the data columns before turning to the row-wise requirements.

DATA COLUMNS: A REPRESENTATION OF QUALITY
The data columns must first include a representation of the quality result. It’s important to note that data might not contain a full representation of how quality is measured in the factory. These gaps in the data are common (such as those created by batch sampling): in some cases, the available data can be sufficient to achieve dramatic results, as demonstrated in the parameter optimization below. The second set of required data columns concerns process parameters. These fall into two types:

Controllable parameters are the ‘levers’ available to the factory operator to alter the process and thus to improve quality. In general terms, these could include controllable aspects of the process, such as spindle speed and depth of cut or chemistry, temperature, and time.
Non-controllable parameters represent inputs to the process that cannot be controlled by the plant operator from day to day, such as the ambient temperature, the identity of the machine tool (in the case of a parallel process), or characteristics of the input material.

Together, these parameter columns should represent the factors that have the greatest influence on quality. However, due to the ability of AI models to learn complex interactions in a large number of variables, a manufacturer is best advised to make all available data points around the process available for inclusion in the AI model. The cost of including additional variables is low. A good AI specialist will employ the necessary statistical techniques to determine whether the variable should be included in the final model. Variables that might be considered marginal at first may contribute to an AI model that leverages effects and interactions in the process that the specialists were previously unaware of, potentially resulting in an improved optimization outcome.

ROW-WISE DATA REQUIREMENTS
The general rule with row-wise data requirements is that the data must be representative of the process and, in particular, of the interactions that are likely to affect quality in the future. A basic aspect of this is to ask: How many rows, i.e. production examples, make a sufficient training set? The answer depends on the complexity of the process. The sample needs to be a sufficient representation of this complexity. In the manufacturing context, the lower bound typically ranges from a few hundred to several thousand historical examples. Training a model on more data than is strictly sufficient tends to increase the model’s confidence and level of detail – which, in turn, is likely to further improve the optimization outcome. A sufficient number of historical examples does not in itself guarantee a representative sample. The historical examples should also be representative with respect to time. The dataset should be sufficiently recent to represent the likely operating conditions – like machine wear – at the time of optimization.

In many cases the data should also represent one or more sufficient periods of continuous operation, as this allows the AI to learn which operating regions can be sustained, as well as how effects from one part of the process propagate to others over time.

CONSISTENCY AND CONTINUED DATA AVAILABILITY
The last key data requirement is consistency and continued availability. In order to keep the AI model current with operating conditions on the production line, fresh data must be available for regular retrains of the model. This, in turn, requires some level of integration with the data source. In a worst-case scenario, this might mean a continuous digitization process if the record-keeping system is offline, or manual exports of tabular data by factory technicians. These approaches are relatively labor-intensive and may be subject to inconsistencies. An ideal setup would consist of a live data stream from the manufacturer’s data bus into a persistent store dedicated to supplying the AI training pipeline. For some manufacturers, a mixture of approaches is appropriate to cater for multiple plants.

Continued data availability goes hand in hand with the requirement for data consistency. This is best illustrated with a negative example, in which a factory intermittently changes the representation of variables in data exports, such as whether a three-state indicator is represented as a number from the set {1, 2, 3} or as a string of text from the set {‘red’, ‘orange’, ‘green’}. If uncaught, these types of changes could quietly corrupt the optimization model and potentially result in a negative impact on process quality. The digitization and automation of process data infrastructure and data exports goes a long way towards addressing these issues. Whatever the factory’s data infrastructure, however, a good AI ingest pipeline should feature a robust data validation layer to ensure inconsistencies are flagged and fixed.

Subscribe to learn the latest in manufacturing.

Joris Stork

Joris Stork is a senior data scientist at DataProphet (Pty) Ltd, 109A 2nd Floor, The Foundry Building, 74 Prestwich Street, Green Point, Cape Town, South Africa 8005, +27 21 300 3555, www.dataprophet.com. Send an email to: [email protected].