56
Oskar Siljama Reliable defect detection using machine learning for ultrasonic in- spection of nuclear power plant welds Master’s Thesis Espoo 06.09.2020 Supervisor: Iikka Virkkunen Instructor: Tuomas Koskinen

Reliable defect detection using machine learning for

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reliable defect detection using machine learning for

Oskar Siljama

Reliable defect detection using machine learning for ultrasonic in-spection of nuclear power plant welds

Master’s Thesis

Espoo 06.09.2020

Supervisor: Iikka Virkkunen

Instructor: Tuomas Koskinen

Page 2: Reliable defect detection using machine learning for

Aalto University, P.O. BOX 11000, 00076AALTO

www.aalto.fiAbstract of master's thesis

Author Oskar Siljama

Title of thesis Reliable defect detection using machine learning for ultrasonic inspectionof nuclear power plant welds

Master programme Mechanical Engineering Code MEC

Thesis supervisor Iikka Virkkunen

Thesis advisor(s) Tuomas Koskinen

Date 06.09.2020 Number of pages 49 Language English

Abstract Ultrasonic testing of austenitic stainless steel welds is tedious due to the complexmaterial structure that causes much noise and attenuates sound immensely. State-of-the-artphased array systems is used to improve the reliability of inspection by recording richer datasets with a higher signal-to-noise ratio (SNR). However, this further burdens the inspectorbecause of the large amount of data to examine. Convolutional neural networks have beensuccessful for past years as a result of the development of large-scale data sets. These dataset have allowed models to obtain high accuracy in demanding task, that was earlier con-sidered infeasible. Image classification network has been successfully employed to analyseultrasonic B-scans and compared to human inspectors, showing superhuman performancebut with only a few different flaws available. In these studies, data augmentation has beenutilised to obtain feasible sized data sets. It is of interest to employ image classificationalgorithms to more complex ultrasonic data with an increased number of flaws.

The purpose of this thesis is to construct a convolutional neural network to classify virtuallyembedded flaws from multichannel B-scan obtained from austenitic stainless steel weld in-spection. These welds simulate critical component of nuclear power plants. Moreover, theresearch includes verification of the model by predicting out-of-sample scans of a previ-ously unseen weld canvas to assess the reliability of inspection.

The results of this thesis show that 1) austenitic stainless steel weld B-scans has very highnoise content that can conceal the embedded flaw signals; 2) the model can obtain goodperformance in image classification of multi-channel austenitic stainless steel weld B-scans;thus, displaying high reliability; 3) the key for employing machine learning in ultrasonicinspection is the large-scale data sets composed with sophisticated data augmentation dueto the lack of flawed samples, which is typical for such inspection; 4) the model is suscep-tible to the trained flaw sizes; by excluding small flaws from training the model learns pat-terns more efficiently without large false call rates.

Keywords Phased array, Ultrasonic testing, Machine learning, Image classification

Page 3: Reliable defect detection using machine learning for

Contents1 Introduction .................................................................................................................... 1

2 Ultrasonic inspection of austenitic stainless steel welds .................................................. 32.1 Fundamental principles of ultrasonic inspection ....................................................... 4

2.2 Acoustic impedance and attenuation in flaw detection ............................................. 5

2.3 Introduction to phased array ultrasonic inspection .................................................... 8

2.4 Advanced phased array ultrasonic inspection ......................................................... 11

2.5 Ultrasonic inspection of austenitic stainless steel welds.......................................... 11

2.6 Analysing ultrasonic signals .................................................................................. 13

2.7 Evaluating the performance of ultrasonic inspections ............................................. 13

2.8 Ultrasonic testing in the scope of machine learning ................................................ 14

3 Machine learning and image classification .................................................................... 15

3.1 Machine learning models ....................................................................................... 15

3.2 Forward propagation .............................................................................................. 17

3.3 Backward propagation ........................................................................................... 18

3.4 Overfitting and underfitting ................................................................................... 20

3.5 Distributing the data .............................................................................................. 22

3.6 Data augmentation ................................................................................................. 23

3.7 Batch normalisation ............................................................................................... 23

3.8 Densely connected neural network ......................................................................... 24

3.9 Deep convolutional neural network ........................................................................ 25

3.10 Image classification and object detection ............................................................. 27

3.11 Image classification in NDT ................................................................................ 29

4 Assessing phased array data with machine learning ...................................................... 30

4.1 Data acquisition and augmentation......................................................................... 31

4.2 Preprocessing the data ........................................................................................... 33

4.3 Machine learning model ........................................................................................ 36

5 Results ......................................................................................................................... 39

Page 4: Reliable defect detection using machine learning for

5.1 Phased array inspection results............................................................................... 39

5.2 Machine learning results ........................................................................................ 39

6 Discussion .................................................................................................................... 42

7 Limitations ................................................................................................................... 448 Conclusion ................................................................................................................... 45

Acknowledgements ......................................................................................................... 45References ....................................................................................................................... 46

Page 5: Reliable defect detection using machine learning for

Table of FiguresFigure 1 Acoustic waves. ................................................................................................... 4

Figure 2 Schematic representation of ultrasonic inspection. ............................................... 6Figure 3 Ultrasonic B-scan. ............................................................................................... 7

Figure 4 Simplified phased array probe. ............................................................................ 9Figure 5 Simplified example of steered ultrasonic wave. ................................................. 10

Figure 6 Angular probe emitting both LW and SW to an anisotropic material. ................. 12Figure 7 General machine learning algorithms. ................................................................ 16

Figure 8 Forward and backward pass ............................................................................... 20Figure 9 Double descent risk curve. ................................................................................. 20

Figure 10 Curve fitting. ................................................................................................... 21Figure 11 Data set is divided into three different sets. ...................................................... 22

Figure 12 MNIST handwritten digit convoluted through single kernel. ............................ 26Figure 13 Architecture of VGG16. .................................................................................. 28

Figure 14 Preprocessing of ultrasonic data. ...................................................................... 34Figure 15 The effect of flaw sizes in training the ML model. ........................................... 35

Figure 16 Used model to train the classifier. .................................................................... 37Figure 17 Effect of noise in B-scans. ............................................................................... 39

Figure 18 POD results ..................................................................................................... 40

Page 6: Reliable defect detection using machine learning for

Table of TablesTable 1 Inspection samples. ............................................................................................. 31

Table 2 True dimensions of each thermal fatigue crack.................................................... 32

Page 7: Reliable defect detection using machine learning for

List of Abbreviations

Adam Adaptive moment estimation

DCNN Deep convolutional neural network

DL Deep learning

DMA Dual matrix array

LW Longitudinal wave

ML Machine learning

NDT Non-destructive testing

PAUT Phased array ultrasonic testing

POD Probability of detection

ReLU Rectified linear unit

SDG Stochastic gradient decent

SNR Signal-to-noise ratio

SW Shear wave

TRL Transmit-receive longitudinal

TRS transmit-receive shear

UT Ultrasonic testing

Page 8: Reliable defect detection using machine learning for

1

1 IntroductionThe trend of ultrasonic testing (UT) in non-destructive testing (NDT) is towards richer dataacquisition to increase the reliability of inspections. Better data acquisition is achieved byrecording more data on the environment using state-of-the-art equipment and methods.Thanks to methods such as Full Matrix Capture, improved imaging can be obtained from ahigh number of A-scans computed from each transmit-receive combinations of the phasedarray probe; thus, providing more data at a single scan location of inspection. As a result,the data set sizes have multiplied, also containing more redundant data.

Inspection of highly complex austenitic stainless steel welds is still conducted manually ormechanised by human inspectors, including the data processing. The reason is that auto-mated systems have been unable to assess such data and flag flaws when a certain limit issurpassed. So, with the advancement of improved data acquisition, the workload of goingthrough the data increases tremendously. Moreover, most of the data is flaw-free, which mayincrease the probability of false calls. Therefore, assessing the structural integrity of thesecomponents can be tedious. Thus, novel data analysis methods should be considered to alle-viate the human inspector’s burden but also to increase the reliability of inspection.

Machine learning (ML) convolutional networks have experienced great success for the pastyears in image classification and object detection tasks. What has enabled the use of thesehighly complex deep learning networks is the large-scale data set as a result of improveddata acquisition methods. Moreover, the development in computational resources such asgraphical processing units and tensor processing units have mitigated the operation of deepneural networks. As a result, training networks have become much faster, and their speed iscontinuously improving. Therefore, it is of great interest to take benefit of these tools to thepractical application of UT, where going through results resembles image classification. Be-sides, neural networks require less feature engineering because they can learn features fromthe data.

For the human to go through data set from state-of-the-art phased array inspection is tedious.ML could take full benefit of the increased data set by also taking advantage of the emptywelds to identify flaw-free images. A problem with the data sets often occurs when the dataset is imbalanced so that there are a clear majority and minority class. Data set imbalance isoften true in ultrasonic inspections, where the number of flawed samples is often scarce.Limited data sets have also been a problem in ensuring the quality of inspection in the qual-ification processes of human inspectors, where a certain number of flaws are required tocapture the variation. Also, an ML model requires even more data than a human inspector toobtain comparable performance. Data augmentation is employed to increase the size of thedata set by introducing transformations to the original data mitigating problems with limiteddata sets.

Earlier studies have demonstrated excellent classification performance on ultrasonic A-scanscontaining exclusively of flawed samples (Munir et al., 2019). Image classification algo-rithms have also been employed in flaw detection of ultrasonic B-scans (Virkkunen et al.2019). However, the structure and images have been single-channel scans with only a fewdifferent flaws embedded. Therefore, of significant interest is to apply state-of-the-art imageclassification techniques to highly noisy multichannel B-scans obtained from austeniticstainless steel welds that are standard components through nuclear power plants.

Page 9: Reliable defect detection using machine learning for

2

This study aims to construct and verify a state-of-the-art convolutional neural network per-formance in an analysis of 316L austenitic stainless steel welds multichannel B-scans. Thesescans are obtained from dual matrix phased array UT to ensure maximal capture of infor-mation. Successful verification of model performance is justified by assessing the false callrate and the probability of detection analysis based on the prediction of out-of-sample data.Moreover, no misses of large flaws should be observed.

The scope of this study is limited to the phased array ultrasonic NDT method and on convo-lutional neural networks. Basics of UT and ML is included in the theory to support the learn-ing of these more advanced methods. Other topics within UT, such as time-of-flight-diffrac-tion and electromagnetic acoustic transducers, are excluded. Moreover, A- and B-scans areprocessed, so other imaging techniques are not of relevance. In ML, recurrent neural net-works and image segmentation are excluded; the focus is on image and B-scan classificationusing convolutional networks, but features of some others are discussed such as object de-tection that has high potential in the field.

The thesis is structured as follows. Chapter 2 reviews the fundamentals of ultrasonic inspec-tions and builds up to state-of-the-art phased array ultrasonic inspections. The knowledge ofthe phased array is utilised to comprehend the inputs provided to the ML model. Chapter 3outlines the fundamentals of ML and deep learning and reviews previous solutions of ML inNDT. Several fundamental mathematical representations and formulas are provided that de-fines the operation used in the constructed neural networks. Chapter 4 describes the approachused for data acquisition, materials, preprocessing and ML models architecture in this study.Chapter 5 provides the results, including the performance of the model and POD. Chapter 6continues with analyses the results, justifies the model performance, and discuss potentialapplications and future improvements. Chapter 7 outlines the limitations of this study. Chap-ter 8 concludes the thesis by summarising the methodology and results.

Page 10: Reliable defect detection using machine learning for

3

2 Ultrasonic inspection of austenitic stainless steel weldsNon-destructive testing (NDT) is a field of analysis methods for testing material or systemswithout obliterating its future usefulness. In NDT, the task is generally to detect flaws in theinspection targets so that necessary precautions can be employed in advance. Therefore,NDT can be beneficial for manufactures in numerous ways: to ensure the reliability of theproduct, avoid failures, prevent fatal accidents and to sustain the quality of the production.

There are various analysing methods in NDT. The most common are visual inspection, liquidpenetrant, magnetic particles, radiographical, eddy current and ultrasonic. In visual inspec-tion, the test object is observed visually directly or by optical instruments to identify super-ficial irregularities. In liquid penetrant testing, the test object is subjected to dye penetrantthat enhances discontinuity detection on the surface of the test object that earlier may havebeen undetectable with a standard visual inspection. In magnetic particles testing, tiny fer-romagnetic particles are deposited on the material that crowds on the discontinuities whenthe test object is magnetised. Thus, magnetic particle testing is only limited to ferromagneticmaterials. Radiographical testing is based on the absorption of radiation of the test object,which can distinguish material properties or discontinuities based on absorption differential.Eddy current testing is based on electromagnetic induction, where an electric coil introducesan electromagnetic field on the test object. This field causes a magnetic field (eddy current)to flow on the object by induction that opposes the electromagnetic field. Defects are iden-tified as changes in impedance of the coil. Finally, ultrasonic testing (UT) is based on theexcitation of a test object with ultrasonic waves that may react with discontinuities withinthe material. (Workman & Kishoni, 2007)

Austenitic stainless steels are preferred for nuclear power plants components inspection dueto their high creep and corrosion resistance in combination with good weldability, all ofwhich are necessary for the high-temperature environment (Bhaduri & Laha, 2015). Despitedevelopments in welding methods, these components are still prone to process-induced andin-service defects. Therefore, NDT methods are valuable to verify the structural integrity ofthese materials (Kumar, Menaka, & Venkatraman, 2019).

According to Chassingnole et al. (2010), radiographical techniques can confront the prob-lems of high heterogeneity and anisotropy that govern the austenitic stainless steel welds butis limited to location and geometry. Moreover, radiographical testing is more suited for man-ufacturing defect while ultrasonic inspections suits for service induced flaws, such as ther-mal fatigue cracks. Thus, the ultrasonic inspection can supplement radiographic inspections.Moreover, there have been developments in the ultrasonic instrumentations, such as dual-matrix phased arrays, that have shown improvement in SNR. Thus, they have increased thequality of the inspection throughout the process.

Still, the results of the ultrasonic inspection of complex materials, including the data acqui-sition and its evaluations, are highly dependent on the inspector’s performance. Therefore,it is prone to human errors that can develop to fatal risks if large-sized defects are missed, orunnecessary costs due to false calls. Thus, it is necessary to validate the inspections methodsto ensure their reliability in critical applications.

The basic concept of ultrasonic inspection is that ultrasonic waves are propagated in a testobject producing discontinuity on the sound energy when collisions, refractions, transmis-sions, or reflections occur. These effects may be initiated by numerous reasons, for example,

Page 11: Reliable defect detection using machine learning for

4

by grain boundaries of an anisotropic material or other discontinuities such as flaws in thematerial.

2.1 Fundamental principles of ultrasonic inspectionUltrasound is sound waves that are inaudible for the human ear due to the high frequency of20 kHz or above. These sound waves have many applications in fields such as medical andtesting of materials, where the latter is of interest by detecting discontinuities from soundresponses. (Krautkrämer & Krautkrämer, 1977)

In ultrasonic testing (UT) the inspection target is subjected to ultrasonic waves that can bedistinguished into plane, spherical and surface waves. Plane waves are sound waves in theobject with the same phase on the corresponding plane, i.e. two arbitrary points on the planehave the same amplitude. In spherical waves, waveforms are spherical with constant ampli-tude on each sphere front around the source of excitation. Surface waves (Rayleigh waves)propagate on the surface of the material in an elliptical path perpendicular to the surface.(Workman & Kishoni, 2007) The most fundamental wave types for NDT are the ones withplanar propagation, which can propagate in the bulk of the material and highlight internaldefects.

Plane waves can be either longitudinal or transverse. These are the most fundamental wave-forms in UT due to better signal interpretation. In a longitudinal wave (LW) the material iscompressively excited, oscillating the particles in-plane in the direction of wave propagationso that the energy transfers forward. In transverse waves (shear wave, SW), the particlesoscillate perpendicular to the direction of propagation, creating waves that resemble sinus-oidal patterns in the material. The velocities of waveforms are linked to the materialsYoung’s and shear modulus, where LW velocity is calculated from Young’s modulus whileSW velocity is calculated from the shear modulus. Since the shear modulus is always lessthan Young’s modulus, the LW usually possess higher velocity in the material. These wave-forms are visualised in Figure 1. Other waveforms are surface waves that propagate similarlyto LW and SW but only on near-surface of the material. (Krautkrämer & Krautkrämer, 1977)

These wave modes possess different sound velocities that result in different refraction angleson interfaces. Therefore, in flaw detection, different behaviours can be observed on the sig-nals for both wave types. LW may contribute with higher velocities that may extend to betterpenetration depths while SW can maintain better sensitivity. Therefore, a clear strategy ofthe wave type is necessary for the inspection process.

Figure 1 Acoustic waves. Blue particles indicate a longitudinal wave and orange particles repre-sent a transverse wave. Small arrows of the particle indicate the direction of oscillation and thelarge ones shows the direction of wave propagation.

Page 12: Reliable defect detection using machine learning for

5

Emitting just one wave type to the solid medium is usually not possible. Emitting LW inother than a perpendicular angle always results in a complementary SW because of modeconversion. In some cases, the interaction between these waves within the test object cancause interference in the signal; SW can be identified further away from the returned LW inthe monitor due to lower velocities. Therefore, this may become a problem in the resultinterpretation. Thus, the angle of incidence must be selected carefully prior to the inspectionto ensure the direction of wave propagation.

2.2 Acoustic impedance and attenuation in flaw detectionAcoustic impedance

As discussed briefly in the previous section, ultrasonic waves are highly dependent on thereflectors within the test object. The interactions between the ultrasonic waves and the re-flectors are the fundaments of ultrasonic testing since it enables identification of flaws andmeasurement of geometrical properties. A material property called acoustic impedance canwell describe this phenomenon, which is the resistance of sound propagation.

Reflection and transmission are the proportion of the acoustic energy that is echoed back-wards or transmitted forward, respectively, on the interface of different mediums. The ratioof these two can be deduced from the differences in the acoustic impedance of the materials.The formulas for reflected and transmitted waves yield

2 1

2 1

Z ZRZ Z

and 2

2 1

2ZDZ Z

, (2.1)

respectively, where Z is the acoustic impedance and subscripts denotes the two mediums,where 1 is the initial state and 2 is the second state that the wave encounters. The higherthe difference in acoustic impedance between the two materials, the higher is the reflectionof the wave. Most importantly, the variation of acoustic impedance makes the flaws withinthe material visible, since the acoustic energy is reflected, refracted, diffracted, or scatteredon the boundaries of different impedances. These interactions are identified as a change inenergy content or time of arrival relative to adjacent waves. (Krautkrämer & Krautkrämer,1977)

The angle of incidence determines the angle of the wave transmitted into the material. Therefracted angle on the second medium can be calculated with the sound velocity and theincidental angle by using the law of refraction (Snell’s law) that yields

1

1sin sin sinLW SW

LW SW

C CC

, (2.2)

Page 13: Reliable defect detection using machine learning for

6

where 1C and 1 denotes the velocity and incidental angle, respectively. LWC and SWC are

the velocities of LW and SW in the material, while LW and SW are the propagation angles.Therefore, the angle of incidence has a strong influence in weld inspection where angledbeams are required since the weld bead prevents inspection from above. Angled beams canbe accomplished, for example, by assembling a wedge on the probe or using angled probes.(Krautkrämer & Krautkrämer, 1977)

An example of the previous concepts is presented in Figure 2, where a phased array probe ispulsed with different delay times that steer the ultrasonic beam towards the weld. Down-leftof the figure, the two different wave modes are shown within the encountered weldment.The columnar grains of the weld here dictates the direction of these waves. An incident angleother than perpendicular on the interface results to supplementary sound waves, due to modeconversion, that may weaken or corrupt the signal. The waves are also reflected on the in-terfaces that are not shown in the figure for clarity. In reality, the wave scattering, reflectionsand refractions are immense in these welds. The monitor shows the received echo signal.

Figure 2 Schematic representation of ultrasonic inspection. The probe is pulsed with different timestepsthat yield an oblique wave front, directed at the weld. The case is here simplified, so that the base mate-rial is homogenous and wave propagation here is not disrupted. When propagating in the weld, thewave is refracted and reflected due to the heterogeneity that govern the weld. Sound wave is convertedto both LW and SW due to mode conversion.

Page 14: Reliable defect detection using machine learning for

7

Acoustic impedance in flaw detection

UT signals can be assessed by inspecting ultrasonic images constructed from the responses.A few of the most important imaging modes are amplitude scans (A-scans) and brightnessamplitude scans (B-scans). In A-scans the reflectors can be observed as peaks in amplitudewithin the range of the inspection, or if stacked to a two-dimensional representation (B-scan)as brightness indicators. These images can be produced as a result of the acoustic impedancethat varies within the test object. Figure 3 demonstrates an example of a B-scan of a weld-ment where the bright pixels inside the marked red ellipse indicates considerable acousticvariation originated from a flaw, while the contrast following this show the weld and modeconversion.

A-scans are the one-dimensional elementary presentation of a signal, where one axis char-acterises the elapsed time while the other axis shows the amplitude that represents the inten-sity of the signal. The second axis may reveal acoustic variations within the test object; thus,flaws can be identified. (Workman & Kishoni, 2007)

B-scans, on the other hand, are two-dimensional images of the cross-section of the test ob-ject, which can alleviate flaw localisation, instead of A-scans that only show single soundpath lines. B-scans are constructed by stacking multiple adjacent A-scans resulting in animage where the depth of each pixel represents the amplitude. In this imaging technique, thevertical sweep is proportional to the depth distance while distance along the test objects sur-face is proportional to the horizontal sweep, or vice versa. (Workman & Kishoni, 2007)

Acoustic impedance in austenitic stainless steel welds

The difficulty of inspecting stainless steel weldment stems from the structure’s heterogeneityand anisotropy, which causes a considerable amount of amplitude peaks (noise) in responsesfrom the variating acoustic impedance that may reduce the contrast of pixels in B-scans. Thereduced contrast of flaw signals complicates flaw identification. Therefore, the inspection ofthese structures cannot be performed by flagging flaws exceeding a threshold value, such astypically done in automated systems. This would result in an excessive number of false callsfor low thresholds that flaws high noise peak. For high thresholds, the noise peaks could besuccessfully ignored, but flaws, also identifiable with human inspectors, are missed. Thesolution to identify the true reflectors is to segregate the characteristics of the flaw from thenoise. Thus, state-of-the-art UT equipment can be employed that increases the signal-to-noise (SNR) ratio, facilitating signal interpretation, and accelerate inspection time. Thesefeatures are discussed later in this chapter.

Attenuation

Attenuation is caused by scattering and absorption of the acoustic pressure during the wavepropagation in the material. Scattering is small reflections occurring due to the discontinui-ties of the material structure, such as high anisotropy or large grain sizes, causing variation

Figure 3 Ultrasonic B-scan. B-scan of stacked A-scans were the bright indicator represents aflaw.

Page 15: Reliable defect detection using machine learning for

8

in acoustic impedance. Attenuation results in a weakened response from unreturned acousticenergy or high noise peaks. If the grain size of the material is only a small fraction of thewavelength (1/1000 or 1/100), the scatter becomes negligible. However, regardless of thegrain size, when the anisotropy of the material is higher, such as in austenitic stainless steels,the scattering often become more significant. This is because of the location dependencieschanging the acoustical properties of the material. The wavelength can be increased, reduc-ing the scattering of the ultrasonic wave that also reduces sensitivity. The second form ofacoustic attenuation is absorption, that is the transformation of sound energy to heat.(Krautkrämer & Krautkrämer, 1977)

According to Ploix et al. 2014, assessing the structural integrity of austenitic stainless steelwelds with ultrasonic inspection is of great difficulty due to the governing anisotropy andheterogeneity that result to high attenuation of the acoustic beam. The base materials of thesestructures are polycrystalline metals with various grain orientation that causes wave scatter-ing in boundaries in the result of the gaps between the acoustic impedances. Ploix et al.measured the attenuation of these weldments and concluded that numerous parameters con-tribute to the loss of acoustic energy, such as the probe aperture, mode conversions and de-viations. Their measurement showed attenuation of approximately 0.06 /dB mm for a grainorientation of 35 and 0.13 /dB mm of 45 for a 2.25 MHz probe.

Furthermore, the concerned weldments columnar grains, oriented according to the solidifi-cation process, causes further beam splitting and distortion when these boundaries are en-countered. This is because of the changing acoustical properties of each columnar grain.(Kolkoori, Rahman, & Prager, 2012)

Attenuation in flaw detection

In A-scans attenuation can be observed as dampened amplitude on the response, which sig-nifies that the receiving signal has lost energy due to beam scattering, distortion, and absorp-tion. Therefore, higher attenuation is also observed for more extended beam travels. If trans-lated to B-scans, attenuation effect can be observed as dimming of reflectors that coalesceswith the neighbouring pixels, and thus reduces the contrast of the prominent signal. Thisincreases the likelihood of a miss. To mitigate inspection, more sophisticated phased arrayultrasonic equipment may be employed that possess multi-transducer systems that capturenumerous transmit-receiver combinations providing more data on the environment frommultiple angles. Thus, high attenuation from one angle may be overlooked from adjacentangles. Phased array systems are discussed further on the subsequent sections.

2.3 Introduction to phased array ultrasonic inspectionThe changing acoustical properties within the structures makes the flaws visible in the bulkof the inspection target. In order to emit the ultrasonic waves to the test object and analysethe response, sophisticated ultrasonic systems are required. This system includes probes,wedges, and flaw detectors (monitors). The probes and wedges are often application-specificwith numerous different properties that all affect the properties of wave propagation.

Several ultrasonic instruments fit weld inspection, such as conventional and phased arrayprobes. The first type of probe (with a single piezoelectric element) is often used with awedge with fixed angled that initially emits LW within the wedge and is converted to LWand SW during the transmission to test object, which can be referred as mode conversion.

Page 16: Reliable defect detection using machine learning for

9

Mode conversion also occurs on interfaces within the test object, where the wave type isrefracted or reflected. (Workman & Kishoni, 2007)

According to SFS-EN ISO 22825:2017, a standard for testing austenitic steel and nickelalloys weld, if the test object were to be directed with LW below the critical angle, then bothLW and SW would be generated at the surface that may interfere the signal. Therefore, anangle of incidence can be selected to be above the first critical angle to reduce disturbance.The first critical angle is classified as the angle where only SW starts to pass from below thesurface. The second critical angle is when no waveforms are emitted into the bulk of the testobject. These principles apply to both conventional and phased array probes.

Compared to the conventional and angled probes, phased array (PA) probes have multipletransducer elements that can be excited at different time steps enabling control over specificcharacteristics of the ultrasonic beam such as beam steering or electronic focusing, withoutrepositioning the probe. Electrical focusing enables beam shaping to an expected flaw loca-tion that increases the SNR at the focal point. Additionally, PA can inspect larger area sim-ultaneous because of more elements in the probe. These properties can be achieved by intro-ducing focal laws to the system that adjusts the delay time of each transducer element forpulsing.

Phased array ultrasonic testing (PAUT) of austenitic stainless steels can provide more inter-pretable signal due to increased SNR from electrical focusing. Furthermore, the scanningprocess can be accelerated by shorter scan lines. However, weldments introduce a consider-able amount of discontinuity, thus ideal results may require more advanced systems to in-crease SNR, such as dual-matrix arrays with separate arrays for transducing and receiving.

The essential properties of a phased array probe are probe aperture, number of transducerelements, pitch, and frequency. The physical components are also visualised in Figure 4.Aperture is the distance of the outer elements that increases the physical size of the probe.Increasing the aperture increases the transition and focal depth. (Workman & Kishoni, 2007)Pitch is the separation distance of each element. Frequency affects the transmission depthand the sensitivity of the ultrasonic beam, as discussed in the context of attenuation in theprevious section.

Figure 4 Simplified phased array probe. All eight transducer elements are excited simultaneously.

Page 17: Reliable defect detection using machine learning for

10

The maximum focal length can be determined by the near field that represents the distancehow far the dynamic focusing can be conducted. Near field can be approximated with for-mula

2

4DN

(2.3)

where D is the diameter of the probe or aperture of a phased array probe (Krautkrämer &Krautkrämer, 1977). From this, it can be deduced that frequency increases the near field, andthus the distance of dynamic focusing. Smaller ultrasonic beams in the near field can beachieved by curved transducer elements that increase the focal zone in the near field butresult in shallower penetration depths. This transition distance can be increased by utilisinga probe with a larger aperture. (Workman & Kishoni, 2007)

As discussed earlier, the benefits in of a PA probe is the electrical steering and focusing ofthe sound beam. These properties stem from the Huygens’s principle, where each point ofthe wavefront emits their so-called elementary waves (that are spherical waves) initiatedfrom each transducer (Krautkrämer & Krautkrämer, 1977). The envelope of these elemen-tary waves constructs the new wavefront, as seen in Figure 5. Therefore, applying focal lawscan steer and focus the beam. (Workman & Kishoni, 2007) The steering of a PA probe isadjusted by introducing various focal laws that are generally integrated with the phased arrayinstruments. For example, by introducing more focal laws, higher precision can achieve insectorial scans since the sweep angle step is smaller (Gommlich & Schubert, 2016).

PAUT has properties that mitigate inspection of highly attenuative heterogenous weldments.The electrical focusing increases the SNR locally that may amplify the reflector from anotherwise noisy background. Therefore, PA is useful in austenitic stainless steel weldments.Moreover, techniques for efficient data acquisition has become more available that enablesstoring of each transmit-receive combination. This method captures each focal point, result-ing in better signal interpretation and SNR at every location. The downside of this is thelarge storage requirement for the high number of A-scans.

Figure 5 Simplified example of steered ultrasonic wave. The elements are excited using focal lawsto obtain skewed wave.

Page 18: Reliable defect detection using machine learning for

11

Effect of the angle of incidence

Satyanarayan et al. (2009) tested different inspection angles with PAUT, where the anglewas set to 35°, 45° and 60°. These angles were also focused and unfocused. The test objectwas a 25 mm thick carbon steel component with a 10 mm bottom crack perpendicular to thesurface. The highest amplitude was shown for the crack tip diffraction with 60° the inspec-tion angle, which was due to the wavefront formation occurring closer to the transducer. Inthe 35° inspection, the crack was at a closer distance; however, the wave formation was notfully developed, resulting in a weak signal at the crack tip. Conversely, on the surface tip orcorner trap, the amplitude was higher for the lower angle, since the wavefront had developeddue to more extended travel, while in the 60° inspection angle the beam had diverged andlost energy resulting to a weakened signal. Therefore, the optimal angle was 45 since itcould illuminate both the crack tip and the corner trap at an acceptable amplitude.

The experiment by Satyanarayan et al. (2009) demonstrates that the choice of angle has asignificant impact on the detection of the flaw. This is to be emphasised in weld inspections,because if a large angle was to be utilised to inspect flaws at a far distance, then these couldbecome undetectable due to the long travel of the beam, especially in austenitic welds withhigh attenuation. Moreover, if the crack was close and inspected with lower angles, interfer-ence at the nearfield to make the flaw intractable. To reduce near field interference, advancedprobes that consist of separate transducer and receiver elements can be utilised.

2.4 Advanced phased array ultrasonic inspectionA more advanced PA called dual matrix array (DMA) probe may facilitate the inspection ofthe austenitic weldments. These probes possess separate matrix arrays for the transmittingand receiving of ultrasonic wave. Compared to the earlier described probes, the additionalreceiving elements of the DMA probe can detect signals before the end of oscillation. There-fore, blind zones can be eliminated without saturating the receiving elements. There are twotypes of DMA wave modes that can be emitted. These are transmit-receiver longitudinal(TLR) and transmit-receiver shear (TLS) that can be optimised by using specific wedged.For optimal TRS propagation, the waves are emitted so that only this wave type persists.TRL is optimised so that the longitudinal and transversal waves maintain minimal interac-tion. Another advantage of a DMA is that the arrays are pitched to an angle (roof angle) sothat the transmitted and receiving waves overlap, causing a so-called pseudo focusing effect.This further improves the SNR increasing its compatibility with acoustically noisy materials.(Workman & Kishoni, 2007)

In addition to the acoustical properties of a DMA probe, the two-dimensional structure ena-bles similar properties in the width direction, giving more precision in focusing and largerscan area. This implies that more data is collected during a similar scan line as with lineararray probe, thus can improve the reliability of inspection as a result of larger data sets.

2.5 Ultrasonic inspection of austenitic stainless steel weldsThe anisotropic microstructure of the material and heterogeneity of the weld increases theattenuation and uncertainty of the ultrasonic inspection due to different acoustic propertieswithin the materials. The variating acoustic impedances cause the velocity to become bothdirection and location dependent. Furthermore, the high scattering within the weldments re-duces the SNR, causing peaks in signals, which further complicate defect localisation. Anexample of scattering can be seen in Figure 6. In these cases, modelling can be beneficial,

Page 19: Reliable defect detection using machine learning for

12

allowing simulations of the ultrasonic waves through the material but can be time-consumingand impractical due to the strongly varying acoustic properties. However, in practice, thiscould provide quantitative and realistic descriptions of the weldments. (Chassignole, ElGuerjouma, Ploix, & Fouquet, 2010)

Austenitic stainless steel welds consist of columnar grains that grow parallel to the directionof the heat flow and do not destroy the columnar structure of the previous bead. Furthermore,the solidification of these welds governs the direction, so that the crystal orientation is de-termined based on the underlying crystals. These columnar grains construct the multipasswelds that further complicates the wave propagation. (Chassignole et al., 2010)

Chassingnole et al. (2010) also studied that the dendrite axis was nearly perpendicular to thefusion lines and the weld surface of two AISI 316L steel samples welded with automatic arcwelding. The two samples had different weaving rates that resulted in different bead stringwidths. In the case of higher oscillation rates, smaller angles on the fibre axis relative to thebead normal were observed. The measurement of velocity by ultrasonic testing showed thatthe sample welded with higher rates maintained higher velocities, which was due to the morehomogeneous structure from the higher grain growth than the other sample, implying thatthe weld bead has an impact of the ultrasonic beam behaviour and that different sections ofthe welds can have significant variations.

Overcoming problems of high attenuation

The challenges of inspecting austenitic stainless steel welds can be alleviated with state-of-the-art UT equipment defined in the previous section. For example, phased array probes thatenable capture of a considerable amount of data, or further with DMA probes that increasethe SNR with pseudo focusing and eliminated blind zones.

In a study by Kumar et al. (2019), the performance of a 16x2 DMA probe (only tested LWin the material) was compared to a 64-element LA probe (tested LW and SW) by two aus-tenitic stainless steel welds, one with two side-drilled holes and the other with the threenotches at the weld root. The experiments showed a significant improvement in SNR anddepth analysis with DMA probe, which was due to pseudo focusing effect, implying thatDMA was more appropriate for the inspection of austenitic stainless steel welds.

Figure 6 Angular probe emitting both LW and SW to an anisotropic material. The ultrasonic wavescatters on some interfaces that also causes mode conversions.

Page 20: Reliable defect detection using machine learning for

13

2.6 Analysing ultrasonic signalsPrevious sections have briefly discussed amplitude-scans and brightness-scans that arestandard UT imaging technique applied during the inspection. These images are attainedfrom firing and recording with delay time and, subsequently, digitalised to A-scan that arethen transferred to the monitor. The A-scans can also be stacked to form two-dimensionalrepresentations on the monitor. For obtaining more accurate imaging, methods such as fullmatrix capture (FMC) can be utilised that fires every element separately and collect the sig-nal with all the receiving elements recording all A-scans for every transmitter-receiver com-bination. Therefore, FMC can store significantly more information for the post-processingof the ultrasonic data. This result in more rich data and probably for more efficient signalinterpretation. The downside FMC is the high number of A-scans stored increasing the com-putational requirements. (Tremblay & Richard, 2012)

Discussion of imaging

These imaging techniques can all be utilised concurrently if the testing equipment can pro-vide the properties. This increases reliability when the test object can be observed from mul-tiple angles. Furthermore, from the scans, it can be realised that the phased array probe isfaster in scanning the areas of a test object as this equipment can catch larger scanning areasimultaneously, while a conventional probe requires additional repositioning of the probe tocapture the same information. Additionally, FMC was reviewed that capture more data fromthe scan location than the standard data capture techniques. FMC then is more valuable forthe post-processing and later the machine learning model.

2.7 Evaluating the performance of ultrasonic inspectionsUltrasonic scans of critical components are very demanding to interpret and are prone toerrors, caused by human inspector or equipment. Thus, methods to verify inspectors’ capa-bilities are required to ensure the quality of the inspection. This can be assessed by evaluatingpractices for UT capabilities using MIL-HDBK-1823a hit and miss POD analysis (Annis,2009). This method has been widely utilised to quantify performance. For example, Kurtz etal. (2013) assessed POD for three teams on PAUT data from different steel specimen con-taining cracks. Virkkunen et al. (2019) used hit/miss analysis to evaluate and compare thehuman and machine learning model’s performances on ultrasonic scans from austeniticstainless steel welds.

POD can be estimated for an inspector by supplying a data set containing both samples withand without defects. The task is to identify the flaws of each sample, resulting in binaryresponses of hit or miss. Hit signifies for an affirmative response when the flaw is present,while miss signifies for a negative response when the flaw is present. The qualitative re-sponses can then be modelled via maximum likelihood regression to a generalised linearmodel (GLM), such as logit, probit, clog-log and log-log, that shows the crack depth and theprobability of detection (Annis, 2009). According to ASTM E3023-15 (2015), the data setfor ASTM POD requires a minimal on 40 cracks to capture the natural variance of differentflaws signals. Moreover, according to ASTM E2862-18 (2018), the POD curve cannot becomputed if all flaws are correctly classified or missed. Therefore, computing ASTM PODcan be difficult due to the low number of flawed samples, which is typical in many applica-tions.

Page 21: Reliable defect detection using machine learning for

14

POD can provide a qualitative estimation of performance. 90/95a defines the long-run fre-quency of crack being correctly classified. It quantifies the flaw size that is detected with a90% probability and confidence bound of 95%. (Annis, 2009) Therefore, if the significantvariance in hit/miss is observed near the 90% probability; larger confidence bound is ex-pected that results in higher 90/95a , which is not desirable. Furthermore, if the desired de-

tectable flaw size is smaller than estimated 90/95a , the inspection cannot be consideredreliable.

2.8 Ultrasonic testing in the scope of machine learningThe scans produced from the inspection are used for training of the model. To be able totrain the model utilising machine learning the following criteria are to be followed: data setrelevance for the task, data set sizes, variance in data and that the data images are labelledcorrectly. Without the information that has been covered in this chapter, training a model forthe given task can become tedious and unachievable in the interest of high accuracy, whichis requisite for NDT systems.

A common problem with inspection of complex structures is human errors. Still, in manysituations, UT application depends on the human inspector, implying that the quality mayrely on the task, fatigue, stress, or even personal life. Furthermore, poorly established orwritten inspection procedures and aims may also complicate the detectability. (Ali et al.,2012) These factors can result in a miss of a significant flaw without reasoning in qualifica-tions. Moreover, the trend of UT moves towards more rich data sets, which implies that thebenefit of phased array systems is used to store as much data as possible. For the humaninspector, this makes the data processing more tedious. For reducing the burden of the humaninspector and take advantage of the data acquisition of the phased array systems, computa-tional solutions should be considered. These computational resources have earlier been lim-ited due to the low number of samples to capture relevant features. Therefore, the computa-tional models can fully benefit from the state-of-the-art equipment that records considerablymore data than earlier techniques.

Page 22: Reliable defect detection using machine learning for

15

3 Machine learning and image classificationUltrasonic testing has proven itself reliable in numerous applications that require to inspectthe component's structural integrity. Modern ultrasonic inspections of highly noisy materialsare conducted mechanised or manually, implying that the inspector is responsible for main-taining consistency of the scanning process while simultaneously observing the responsesignal of the ultrasonic wave. This may result in critical human errors during the inspection.Furthermore, real defects are rarely found in the inspection target so that the inspector maymiss flaws that are usually found or even critical in size.

Many applications lean towards better data acquisitions to improve precision and reliability.An example is state-of-the-art phased array ultrasonic equipment that can capture plenty ofinformation from the same location of inspection. This is to provide maximal information ofthe state but increases the burden of the human inspector and risk for errors because of themultiplied data sizes. Consequently, it is often sensible for the inspector to examine onlyspecific regions of interest. Therefore, humans do not take full advantage of improved tech-nologies.

Artificial intelligence (AI) has been a booming field in research topics and practical appli-cation for the past years. In the early stage of the progression of the field, AI has been ableto surpass humans in games such as chess, where the game itself is a list of formal actions.These actions may have been hardcoded to the AI; thus, being able to actuate formal actionsbased on opponents’ moves. These kinds of formal and abstract task can be challenging forhumans but are very simple to solve for AI. It is just recently that AI has started to reachhuman-level performance in tasks such as speech and image recognition that requiresknowledge, which may require intuitive approaches. Thus, these tasks are difficult to beexpressed formally to the AI. Many challenged has emerged during tasks solved with hard-coded features. These have suggested that the AI should learn this knowledge by obtaininginformation from the features. This has given an introduction to machine learning (ML) thathas mitigated solving real-world tasks that may require intuitive and subjective approaches.(Goodfellow, Bengio, & Courville, 2016)

Large-scale data sets are the key to enables the utilisation of ML. The ML models learnbased on statistical data and make predictions of new data based on experience. Henceforth,more rich information can result in more powerful models, depending on the quality of thedata. Thus, state-of-the-art data acquisition systems can be employed more efficiently bybenefitting from more data. ML has enabled high-level performance without relying exces-sively on methods that cause information loss, such as feature extraction. In addition to theavailability of large-scale data sets, continuous development and accessibility to computa-tional resources have enabled the employment of the heavy ML algorithms. Therefore, betterperforming estimators can be developed with less effort stemming to increased demand forML in more fields of research. Thus, this is attractive in the field of NDT, where reliabilityrequirements are very high.

3.1 Machine learning modelsMachine learning (ML) is a subfield of Artificial intelligence (AI), where data set problemsare solved with simplified mathematical approximations. ML has enabled algorithms thatcan produce rules by themselves instead of relying on manually tailored rules that are typicalfor the traditional data analysis. Therefore, reduced need for hardcoded rules and feature

Page 23: Reliable defect detection using machine learning for

16

extractions are required during the data pipelines from input to output. On the other hand,deep learning (DL) is a subfield of ML where a more demanding task is solved with morecomplex and computationally heavier mathematical representations. In some cases, simplerML models can be sensible for the practitioner, while the complexity of a DL model is in-comprehensible.

ML, in general, can be divided into three distinct categories based on the models learningmethod. The first is supervised learning, where the correct target of each data point or sampleis known. For example, a single ultrasonic scan has a binary label indicating if a flaw existson the scan. The second form of ML is reinforced learning where only actions are provided,such as in a parking simulation with a car trying to reverse to a parking square. Here, thecorrect answer is not provided, but better estimations are given for the subsequent prediction,such as “steer right”. Finally, the last form of learning is unsupervised learning where thetarget is unknown, and no guidance is provided so that the model attempts to compute sim-ilarities based on the inputs. A typical example is solving Fisher’s (1936) iris data set con-taining manually composed samples of iris flowers. The task is to compute the relationshipsbetween the flower features (length and width of the petal and sepal) and determine theirspecies.

Inside each category of the learning categories, there are various types of methods visualisedin Figure 7. Within supervised learning, there is Bayesian models, decision trees, neuralnetworks, support vector machines, ensemble or gradient boosting, and deep neural net-works. Bayesian model is based on Bayes’ rule where an unknown variable that maps theevent can be computed based on prior distributions and the maximum likelihood of that eventoccurring. This can be utilised, for example, to compute the parameters of a linear regressionmodel given the data set with labels. Decision tree models are trained by constructing inter-nal nodes on a hierarchical graph that can be viewed as questions for the input. The classifi-cation or numerical output are estimated based on how the input responds to these questions.Support vector machines (SVM) predicts classes of the data set with some threshold that iscomputed based on the data in higher-dimensional space. For computing the threshold, thereare various methods, such as radial basis function (RBF), to estimate the SVM with sufficientcomplexity for the problem. Many ensemble/ gradient boosting models utilise an ensembleof models that allows each model to vote on the output and therefore averages each models’output. (Goodfellow et al., 2016)

Figure 7 General machine learning algorithms. Clustered to each learning class.

Page 24: Reliable defect detection using machine learning for

17

Neural network models attempt to map the input to the output by multidimensional repre-sentations by combining numerous linear and non-linear operations resulting in a complexmathematical function. Neural networks in Figure 7 represents the shallower network, con-sisting only of one or a few layers that limit its observation capabilities. Deep neural net-works belong to deep learning, where the same principles apply as in the shallower neuralnetworks, but deep neural networks have more increased complexity through the increase oflayers and hidden layers.

There are also various learning methods within reinforced learning, such as Q-learning, Mar-kov decision process and deep Q-learning, where the last classifies to deep learning category.According to Chollet (2018), in many reinforcement learning methods, an agent receivessome states of the current environment and learn policies corresponding to an action that canmaximise the reward, as the example with the self-parking car.

Finally, unsupervised learning can solve clustering problems where there are no labels orguidance available. Common tools in the unsupervised category are K-means-clustering andprincipal component analysis. K-means clustering attempt to classify data points based onthe feature’s similarities, as discussed with the example of the iris flowers. The principalcomponent analysis is based on dimensionality reduction by preserving as much informationas possible. This is performed by the orthogonal projection of the input and then reducingthe dimensions with the least variance (Goodfellow et al., 2016). Thus, the data becomesmore interpretable so that measuring correlation is easier.

Deep neural networks are of interest in an image classification task, and therefore other mod-els are not further discussed here. A more comprehensive description of various supervisedand unsupervised class types is found in the book by Goodfellow’s et al.

3.2 Forward propagationDeep learning (DL) is a subfield of the ML, where the main difference is the automatedlearning process conducted within numerous hidden layers that construct the deep neuralnetwork. The output of the hidden layers has no significant meaning since the network func-tions at a deeper level are often incomprehensible for the practitioner. Technically, DL isdescribed as the multistage process in learning features of the data set. The learning is con-ducted based on the probabilities computed with feedforward networks containing the math-ematical operations that are utilised to reparametrise the weights within the layers.

Feedforward networks are the fundamental basis for neural networks, where the input isaimed to be mapped to the correct output utilising linear and non-linear functions. As anexample, let ,x y and f be the input, output and the function (or functions), respectively.When mapped the formula then yields ( )y f x . In addition, some parameters are to be op-timised during learning so that the formula can yield the most viable results. The formulathen yields ( , )y f x , where represents the learnable parameters. The function f canbe many functions, that in deep learning can represent the depth of the network. It should benoted that this is only a simplified representation where the values usually represent multi-dimensional tensors. (Goodfellow et al., 2016)

It was said that f can include linear and non-linear functions. This is often true, since if theoutput were represented exclusively with linear functions, then the model would output alinear function, which usually is inadequate to map demanding multidimensional data. This

Page 25: Reliable defect detection using machine learning for

18

would result in a model corresponding to linear regression. Therefore, in many tasks, it iscrucial to adopt non-linear functions called activations. A common activation function isrectified linear unit (ReLU) that only extracts the positive part of the argument, yielding

max(0, )g x . Another fundamental activation function is the sigmoid function. This func-tion follows

( ) 1/ (1 exp( ))x x (3.1)

so that the input is squashed to values ranging from zero and one. (Goodfellow et al., 2016)Therefore, it is a commonly used activation for the final layer of a binary classification task,such as flaw detection. There is a broad list of numerous activations fitting different tasksthat are out of the scope of this study and therefore not covered here.

After the forward pass (see Figure 8) the output is computed, after which the difference totrue target (labelled data) is measured. This measurement is used in the backpropagation thatis discussed in the following section.

3.3 Backward propagationOptimisers and objective functions are used in backpropagation of the feedforward neuralnetworks when a new set of weight is determined. The objective function measures the dif-ference between the model’s prediction and true targets (in case of supervised learning). Thedissimilarity is then passed via gradient operations to the optimisation algorithm, which up-dates the function parameters. The objective function is usually called loss or cost functionwhen it is desired to be minimised (Goodfellow et al., 2016).

Say that we want to train a model that predicts if a flaw is present in input images. This isachieved by determining the number of parameters and functions that may map the inputtowards the best outcome. When the network is introduced with the training data set, a dif-ference between the response (mapped with forward pass) and the true target is computedthat can be represented with a continuous loss function to enable differentiation. Subse-quently, the gradients of each layer’s functions can be computed. Based on these, the param-eters can be updated via optimisation to achieve better estimations on the next minibatch.These batches are predefined sets of samples grouped from the complete data set to enablebatch-wise parameter updating instead of using one single sample for every update. Becausethese operations are automated and repeated multiple times, the network can be consideredto learning by itself from previously seen data.

The example just described utilises supervised learning where the correct labels of the sam-ples are provided. This enables that the loss can be computed based on the entropy of thepredictions and true targets utilising the loss function. Subsequently, the model’s weightsand biases are updated regularly according to these values in pre-established batch sizes,utilising gradient-based optimisers and loss functions that are selected based on the task. Thecriteria for conducting gradient-based operations is that the neural network must be differ-entiable. These gradient ops are based on stochastic gradient descent (SGD) where the pa-rameters are modified by smaller increments, in the opposite direction of the gradients, toachieve higher accuracy and smaller loss on the next iteration. Small updates are utilised tosolve the parameters since analytical solutions are infeasible due to the high number ofweights and dimensions. (Chollet, 2018)

Page 26: Reliable defect detection using machine learning for

19

The problem with SDG is that it can produce high oscillation toward unwanted directionsthat result in slower or unconverging iteration steps. Therefore, various optimisers have beendeveloped that stems from SDG with alterations to improve the performance further. One ofthe most utilised is the RMSprop proposed by Geoffrey Hinton (2012), where the gradientis divided by the root mean square (RMS) of the exponential moving average. What thisdoes is dampens the gradient in the irrelevant direction and accelerates the updates in therelevant direction, so that gradient converges more rapidly towards the minima. New weightsare updated in RMSprop with the formula (3.2)

11( )t t g

r

(3.2)

where α is a hyperparameter called learning rate, g is the gradient, 𝛿 is a very small valueto avoid zero division, and r is the accumulated squared gradient (Goodfellow et al., 2016).

Another successful optimisation algorithm is called adaptive moment estimation or Adam.This was introduced by Diedrik Kingma and Jimmy Ba (2014) that utilises features fromboth RMSprop and SDG momentum optimisation methods, such as bias correction. The pa-rameters are upgraded according to formula (3.3)

( )ˆ

tt t

t

mv

(3.3)

where ˆtm and tv are the bias-corrected first-moment estimate (from SDG momentum) andsecond-raw moment estimate (from RMSprop), respectively, and is an infinitesimal valueto avoid zero division.

The loss function and optimisation algorithm functions together to compute the gradientsduring backpropagation that is described below. The loss function is desired to be minimisedduring learning by updating the weights using the optimisers. With loss functions, the per-formance of the model can be measured by calculating the distance of the prediction to thetrue target. The shorter the distance between these, the smaller is the loss and therefore, betterestimations are expected. There are various types of loss functions, depending on the predic-tion types. For example, for a classification problem, the common ones are cross-entropy(log-likelihood) functions that measure the entropy with probability as output. Cross-entropyloss functions are either for binary or categorical classification (for more than two outputclasses). For a regression task, a typical loss function is the mean squared error (MSE).(Chollet, 2018)

Generally, loss functions can have numerous local minimums where the optimisation algo-rithm can converge, which can result in higher costs and inaccuracy. If the local minimahave a loss relatively close to the global one, then this may not be a problem because suffi-cient accuracy can be achieved within the local minima (Goodfellow et al., 2016).

Optimisation and loss functions are the backbones for a successful backpropagation processin the neural networks. Backpropagation is the process that occurs after the output is com-puted. The process computes the derivatives of the previous iteration by using the loss

Page 27: Reliable defect detection using machine learning for

20

function and the chain-rule of calculus. These are then provided to the optimiser for repa-rameterisation. (Chollet, 2018)

A simple forward and backpropagation intuitions are shown in Figure 8, where one hiddenunit with activation function is used to compute the loss. A forward pass is marked with bluearrows while the backward pass is marked with red. Here x represents the input, z is theoutput of the hidden unit, w is the weight, b is the bias term, g is the activation function,L is the loss function, y is the computed output and y is the true target. Subsequently, theloss is then used to compute the gradients that are fed to the optimiser as earlier mentioned.It should be noted that this process is usually conducted on mini-batches, to increase thecomputational speed and to increase the gradient updates. Therefore, a simplified examplecannot fully cover the mathematical operations; however, the same principals apply regard-less of batch and network size.

3.4 Overfitting and underfittingA normally inevitable problem in ML is overfitting, and a fundamental skill for the practi-tioner is to be able to address them. This occurs when the model starts to recognise irrelevantpatterns on the given data. The consequence is that the model may perform well on the train-ing data but not on the validation or evaluation data that measures the generalisation. Thiscauses significant losses relative to the true targets that deteriorate the accuracy. The gener-alisation is essentially the ability of the model to perform on unseen data.

There are various ways that overfitting can arise when training the model. One commonlyknown issue with classical ML models has been the bias-variance trade-off that in modernML models has for the most part been overcome. According to Belkin et al. (2018) in theclassical ML, the model is bound to underfit if the model has high bias. This signifies thatfunction set complexity is too small. In contrast to overfitting, underfitting is the model’stendency not to hold enough functions or parameters to represent the data sufficiently. This

Figure 8 Forward and backward pass with one hidden unit and one activation function. Blue arrow shows a forward passwhile red arrow indicates a backward pass where the gradients are computed.

Figure 9 Double descent risk curve. Left image demonstrates the classical case, where the task was to locate the sweetspot. Right side shows the modern regime with the interpolation threshold and modern regime. (Belkin et al. 2018)

Page 28: Reliable defect detection using machine learning for

21

leads to more mediocre training performance. In the classical ML, it was observed that whenthe function set complexity was much higher (an increase of parameters) the model had highvariance and therefore was bound to overfit. This was due to the model’s saturation to irrel-evant patterns that satisfy the training data so that the model may classify all data points butgeneralises poorly with unseen data. Thus, it has been a tedious task for the practitioners tobalance between bias and variance.

In modern ML methods, it has been observed that practitioners can fit the training set per-fectly with excellent generalisation performance in highly complex functions sets, where theoverfitting regime has been surpassed by increasing the number of parameters. In the articleby Belkin et al. (2018) this was shown with a double decent risk curve, as seen in Figure 9,where the initial curve demonstrated the bias-variance trade-off while the extended secondcurve showed the modern interpolating regime. A so-called interpolation threshold separated

Figure 10 Curve fitting. A) Overfitting, B) underfitting and C) a more desirablecurve fit.

Page 29: Reliable defect detection using machine learning for

22

these two regimes. From the “sweet spot” of bias-variance trade-off to the interpolationthreshold, the model is bound to overfit during the increase of function set complexity. Whenthe threshold was surpassed, the increase of function complexity resulted in higher accuracyuntil convergence.

One of the best and most efficient techniques to avoid overfitting is to enrichen the data setby supplying more samples. Moreover, techniques such as applying weight regularisationsto the loss function can reduce overfitting. This forces the weight to gain smaller values,making the distribution more regular that can increase the versatility of the network. Anothercommon technique is to add dropouts, introduced by Hinton et al. (2012), that deactivatesweights randomly, forcing higher weights on the remaining features. However, these fea-tures are not deactivated during validation of the testing data. (Chollet, 2018)

A simplified example of a classification problem is demonstrated in Figure 10 where curvesare fitted to enable classification of crosses and circles. A) represents overfitting, where themodel classifies all the training data correctly with a highly complex curve. This model mayperform poorly when the new points are predicted. B) represents underfitted curve, wherethe model is too simple to estimate the new data points. C) demonstrate a better generalisa-tion performance as to A) by allowing misclassification.

3.5 Distributing the dataML models must be tested after training to measure their true performance. As discussed inthe previous section, modern ML models still suffer from overfitting that must be identified.This is accomplished by setting aside separate testing or evaluation set consisting of samplesthat differ from the training set but have relevance to the task. After training, the modelpredicts these samples from the evaluations set without adjusting any parameters, displayingits true generalisation performance. Furthermore, another smaller data set is also extractedfrom the complete data set called the validation set. This set also consists of relevant samplesfor the model. The validation set is left the smallest of the three since this is utilised duringtraining to assess the model’s generalisation performance. Thanks to this, actions such assaving the best-performing weights during training or early-stopping the training can be

Figure 11 Data set is divided into three different sets. Crosses indicate that samples are not shared between sets (no leak-age). However, information leakage from validation to training set occurs during parameter tuning.

Page 30: Reliable defect detection using machine learning for

23

performed, if the model performance deteriorates. Each set must be kept separate at all time,as seen in Figure 11. If there were information leakage from validation to the training set,the generalisation performance would be deceptive. Same applies from the training set to thetesting set. Additionally, the test set must differ from the validation set, despite both beingutilised to measure generalisation. This is because information can leak from the validationsamples via hyperparameter tuning based on the validation measurements, as seen in Figure11. (Chollet, 2018)

3.6 Data augmentationOne of the main problems with ML is the scarcity of available data so that generalised rep-resentations can be developed. This is since the model is unable to extract relevant infor-mation due to the lack of samples. Therefore, when the data set is enriched, the model maycapture the feature information and associate it with new data.

The reason why ML is challenging to implement to the field NDT is the scarcity of data orflawed data. (Virkkunen, Koskinen, Papula, Sarikka, & Hänninen, 2019) To get by or man-ufacture such actual flawed data can be difficult or expensive.

Data augmentation is a tool to mitigate the problem of small data sets. New data can becreated by introducing random transformations on representative samples to increase thedata set size. This may provide the model with more versatility that can prevent it fromsaturating by memorising the training data. (Chollet, 2018) However, traditional data aug-mentation is not to be implemented incautiously on tasks: the relevance of such should beconsidered, i.e. if rotated or sheared signals are realistic. Additionally, it should be realisedthat data augmentation does not compensate for the versatility of true unique data, like var-ious natural flaws all yielding unique signals.

Virkkunen et al. (2019) generated representative flaws extracted from scanned flawed sam-ples. The virtual flaw could be modified by depth or length, and moved and then implantedto different samples, enabling nearly unlimited representative samples for training and eval-uation of the model.

Data augmentation can also reduce the skew of the imbalanced distribution of positive andnegative samples that are typical for many applications. The consequence of instinctivelytraining a model with a heavily skewed (imbalanced) data set is that the majority class moredictates reparameterisations. This is since the loss function is introduced more often withmajority samples and therefore have a more significant impact on the gradient updates (sta-tistics skewed toward majority class). Thus, by data augmenting new minority class samples,the data set can be balanced that mitigate learning of the model.

3.7 Batch normalisationSignificant variability in the input values can cause substantial weight upgrades during back-propagation. This can result in learning instability. Therefore, each input must be scaled ornormalised, so that smaller variance is obtained between the input values Furthermore, thissaves computational cost since the processing of smaller values in a network is much cheaperand faster.

In more complex networks, large-valued outputs from the hidden layers can occasionally beobserved, causing similar instability as discussed above. For example, a convolution and

Page 31: Reliable defect detection using machine learning for

24

non-linear activation operation can inflate the values of an input or feature map leading tosignificant variance. Thus, it is feasible to normalise the data during the forward pass byapplying batch normalisation (batch norm) layers that re-centres and re-scales the input withzero mean and unit variance. Also, as recalled, the training is conducted in mini-batches soglobal normalisation cannot be done so this operation is performed batch-wise.

There has been a great deal of discussion about the effectiveness and positioning of batchnorms in the network. Batch norm, proposed by Sergey Ioffe and Christian Szegedy (2015),was intended to address the internal covariate shift of the network, by rescaling the inputswithin the hidden layers to ensure that learning is stabilised. Specifically, this forces theweight to maintain approximately the same mean and variance despite the changes from thegradient updates. By definition, the covariate shift is the change of the independent variables.So as the parameters of the previous layer changes, consequently, the input distribution ofthe second layer changes, thus changing the activation distributions. Hence, lower learningrates were previously required to be able to achieve convergence that on the other hand,decelerates the training of the model. Additionally, the batch norm was observed to regular-ise the network, as the dropout layer.

It was also proposed by Ioffe and Szegedy (2015) that batch norm in convolutional networksshould be operated before the non-linear activation function. It was said that normalising anoutput of the activation would not reduce the covariate shift since the shaped of this distri-bution would probably change during the training of the model. Instead, the shape before theactivations are more likely to be non-sparse and symmetric so that when normalised, theactivations can generate more stable distributions.

Dmytro Mishkin et al. (2017) conducted empirical research regarding the order of batchnorm concerning the non-linear activation, where batch norm later to activation showedhigher accuracy in testing. Additionally, Christian Garbin et al. (2020) compared the perfor-mance of dropouts and batch norm on a DCNN, batch norm set later to activation, where itshowed that adding the layer had no side effects and that it should be the first step to optimisethe DCNN before dropouts.

In some manner, dropouts added to networks with batch norm can be considered trivial ordamaging. For example, if dropout would be performed prior to the batch norm, the distri-bution during testing would not correspond to the distribution during training. This is be-cause dropout is only active during training. Moreover, if the order were changed, the drop-out would harm the distribution from the batch norm.

3.8 Densely connected neural networkThe densely connected neural network utilises layers that connect all inputs to each unit ofthe layer performing matrix multiplication for each combination. This means that the net-work attempts to map similarities based on the inputs vector data. Consequently, when allinputs are connected to each unit, the number of parameters in the network increase tremen-dously if, i.e. the input is a large image. Despite this, the dense network is fast and straight-forward to implement for tasks that it can handle. Generally, dense networks perform wellon tasks where the given data is vector data (two-dimensional, including the sample and thefeature). For example, a densely connected network can achieve relatively high accuracy onthe MNIST data set, which is a classical database of handwritten digits. Processing this data

Page 32: Reliable defect detection using machine learning for

25

set with networks is often considered as the first step into the field of machine learning.(Chollet, 2018)

Dense layers are also often found as the last layers of other types of networks, such as deepconvolutional networks. These layers provide the probabilities of classification so that theloss can be measured. Therefore, the high dimensional outputs of the last convolutions mustbe flattened to a vector so that the dense layers and the classifier can finalise it. (Chollet,2018)

Dense in NDT

Previous studies have shown that neural networks can classify flaws from materials withrelatively high accuracy. For example, Cruz et al. (2016) examined multilayer perceptron(MLP) neural network, a network with dense layers, in combination with feature extractiontechniques to identify flaw/no flaw signals from steel welded joints, where after identifyingthe flaw type.

Additionally, to the study of the MLP network with inputs from feature extraction, Munir etal. (2018) attempted to train densely connected neural networks to identify flaw types with-out using feature extraction. Here, a shallow network, containing a single dense layer with502 units, the performance was compared to densely connected network with two hiddenlayers (500 and 50 units, respectively) and three dropout layers (dropout probability of 0.5).Measurements were conducted on a database from earlier studies containing artificially in-duced cracks, lack of fusion, lack of penetration, porosity, and slag inclusions in steel weld-ments (Song & Kim, 2000). The best performance was achieved with the network for lackof fusion defect with 96%� correct classified. The regularising effect of the dropout wasalso observed with the single layer, providing a minor performance boost in some classes.These researches demonstrated potential performance with just a couple of dense layers withdropout, further signifying the potential of neural networks in the field of NDT.

3.9 Deep convolutional neural networkDeep convolutional neural networks (DCNN) has proven themselves powerful in image clas-sification and object detection tasks. DCNN usually have a more complex architecture in-stead of sequential matrix multiplication operation of fully connected layers. A DCNN hasconvolutional layers that can learn local patterns or features from the input by sharingweights. This enables the same features to be recognised in different locations of the input.

Compared to dense networks, a DCNN holds sparse connections in layers between the inputand output units. This is described by convoluting the input with a filter or kernel, which aremarginally smaller tensors than the input. Convolving the input image with the kernel canoutput the show the concerned feature in the input. Subsequently, the kernel with sameweights is then convoluted though the next input (in the same batch) showing the same fea-ture on that image in so-called “feature map”. This enables parameters sharing or sparseconnectivity, which reduces the memory requirements and increases efficiency by reducingthe number of parameters. (Goodfellow et al., 2016)

Convolutional layer

Convolutional layers are implemented to the model as one-, two- or three-dimensional. Thenumber of dimensions informs how many axes the convolution is performed. This implies

Page 33: Reliable defect detection using machine learning for

26

that one-dimensional convolution moves spatially in one direction, while two-dimensionaliterates through two dimensions and so forth. These convolutional layers all work similarlyby applying the same convolve operation to each region of the input. Therefore, in principle,this operation may recognise the current feature anywhere on the input. This can be visual-ized in Figure 12 with a conservative kernel convolved through an MNIST handwritten digit.The feature map demonstrates the relevance of the feature to the input. During learning, theweight and biases of these kernels are reparametrized to obtain features detectors that mapthe input more efficiently. (Chollet, 2018)

The discretised convolution operation for a two-dimensional image can be represented with

( , ) ( , ) ( , )m n

S i j I i m j n K m n (3.4)

where the result S is a feature map. i and j denote the assessed location in the feature mapand input image, while m and n denote the elements that are multiplied in both kernel andregion of the input. This operation provides the feature map with the current kernel by sum-ming each product of the image and kernel on the concerned window that is currently beingprocessed. (Goodfellow et al., 2016)

Convolutional network

Convolutional neural networks are typically stacked with sequential convolution layers. Theoutcome of this is that deeper layers may detect more higher-level features. This is becausethe input of these layers is spatially reduced feature maps. In summary, at low-level, thenetwork detects edges, lines, and corners while at a higher-level, the layers detect moreglobal features such as eyes and a nose, in a face recognition task. This does not imply thata high-level feature map would resemble a nose but that if that feature was present, thenhigher activations could be expected on some feature maps. (Chollet, 2018)

Because the DCNN often comprises of many convolutional operations, the number of learn-able parameters can rapidly increase from the number of feature maps. Therefore, poolinglayers are beneficial to reduce the spatial dimensions of each feature map, maintaining thenumber of feature maps or channels so that that pixel windows can be represented with asingle pixel. In image or object detection, Max-pooling operation is a good suggestion thatapplies a predefined window with a hardcoded maximum tensor operator extracting only the

Figure 12 MNIST handwritten digit convoluted through single kernel. MNIST digit is on the left image and is convolvedwith the kernel show in the middle. The resulting feature map is show on the right.

0 1 1

0 0 1

1 0 1

Page 34: Reliable defect detection using machine learning for

27

max value from the current window and outputs one value per window shift. (Chollet, 2018)For example, if a single feature map of arbitrary size would be pooled with a window of2 2 , the resulting size of the feature map would be halved from the original size.

Another advantage of the pooling layer is that it can introduce the network with translationinvariance. For example, a translated object in an image generates dissimilar activations thanthe original image during convolution but can give the same response from the pooling. Thisis useful in tasks where features presence is only of interest and not particularly the location.The drawbacks of pooling are the lost spatial information. (Goodfellow et al., 2016) There-fore, more care should be put to pooling operations for object detection.

Finally, at the end of DCNN, the high-dimensional feature maps can be vectorised by takingthe product of its shape. This enables the use of dense layers with units representing theprobabilities for classification. Some more advanced object detection networks can outputhigher-dimensional tensors, consisting of the object class and spatial attributes.

3.10 Image classification and object detectionConvolutional neural networks have experienced great success in image classification andobject detection. The more basic of these two variants are the image classification networksdeveloped to predict objects class/classes but does not directly provide spatial information.This classifier may be translated to an object detector if the input image would be croppedto smaller segments, and the model would estimate each segment. If successful, the modelcould classify the segments with the object, and therefore, provide the spatial information.However, this method is inefficient due to the increased number of inputs. Therefore, spec-ified object detection models could be a better option.

Simonya and Zisserman (2014) introduced very deep convolutional networks called VGG16that obtained first and second place in ImageNet Large-Scale Visual Recognition Challenge(ILSVRC2014) in localisation and classification task, respectively. Their two proposed net-work configurations comprised of 16-19 weighted layers. The reason for the success of thesenetworks has been due to the availability of large-scale data set and developments in high-performance computing that have enabled deep neural networks. Initially, in the paper, themodel was constructed to provide class scores using dense layers but was reconfigured tooutput 4-dimensional tensors for localisation tasks to predict object bounding boxes.

The architecture of VGG16 is shown in Figure 13. Initially, the image is passed to two se-quential convolutions, where more global features can be identified. This is followed bypooling the feature maps with a maximum operator. The network then follows the samepattern with added convolution after the second block. Following the convolutions, featuremaps are vectorised and passed to dense network, comprising of three dense layers and thefinal classification nodes. Many neural networks utilise the network or same pattern asVGG16 due to its simplicity and great success in classification tasks.

Recently object detection networks have shown state-of-the-art performance while pro-cessing image frames in real-time. Region-based convolutional neural network or R-CNNutilises region proposals from input images, generated from separate region proposal net-work branch instead of selective search algorithms that were done in the earlier configura-tions of R-CNN. These region proposals are then pooled through a Region of Interest (RoI)layer that constructs equal-sized feature maps and feeds them to the main branch for final

Page 35: Reliable defect detection using machine learning for

28

classification. (Girshick, Donahue, Darrell, & Malik, 2014) During test time, one of the new-est configurations of R-CNN called Faster R-CNN processed images 5 frames per secondon GPU (Shaoqing, Kaiming, Girshick, & Jian, 2017).

You Only Look Once or YOLO, proposed by Redmon et al. (2016), is another object detec-tion network that has shown high classification and localisation performance but with mar-ginally higher image processing time than Faster R-CNN. In contrast to VGG16 and Fast R-CNN, YOLO outputs a tensor with the shape corresponding to the number of classes andbounding box coordinates. Additionally, the architecture is unified so that no separate branchis required to estimate the region proposals. Newer configuration YOLOv3 is well suited forreal-time object detection due to its multiplied inference speed in GPU compared to Fast R-CNN (Redmon & Farhadi, 2018).

The problem with these networks is that they are pretrained to specific tasks. For example,VGG16 is trained with and for ImageNet data set, meaning that the weights are establishedaccording to the samples of this data set. Therefore, the network should fully or partially beretrained for new types of input due to its unique feature detectors from pretraining. Moreo-ver, this implies that the models are overly complicated, with too many layers since they aredesigned for complicated classification and object detection tasks. For example, YOLOv3comprises of 50 million parameters. In the scope of more straightforward UT images andfeatures, a shallower DCNN can perform equally well, reducing the training and inferencetime, which also can facilitate model implementation to practice.

Figure 13 Architecture of VGG16. The multiplier inside the convolution blocks defines how many convolutions are per-formed prior to pooling.

Page 36: Reliable defect detection using machine learning for

29

3.11 Image classification in NDTAlexNet DCNN network has been utilised for crack detection in concrete structure imagesby Dorafshan et al. (2018), where the performances were compared to conventional edgedetectors. The performance was further compared by partially training the network withtransfer learning and classifier learning that both trims down the number of learnable param-eters by utilising pretrained weights obtained training AlexNet with ImageNet data set. High-est performance was observed with transfer learning, by freezing some of the convolutionallayers, with 86% cracks detected, while the prediction of crack not being present was 99%.This research showed superior performance with DCNN compared to traditional edge de-tector schemes such as Prewitt and Gaussian schemes, showing neural networks opportunityin the field of NDT once more.

Virkkunen et al. (2019) developed a modern DCNN to detect flaws from austenitic stainlesssteel welds ultrasonic data and compared its performance to human inspectors. The data wasacquired with phased-array ultrasonic equipment from a butt-welded 316L pipe with thermalfatigue-induced cracks. Subsequently, the data was augmented to increase the data set, asdiscussed in Section 3.6. The DCNN utilised resembled a VGG16 network, with sequentialconvolution and pooling layers followed by dense classifier layers. The test results demon-strated a perfect fit of the network to the training set by showing minimal validation loss andhigh validation accuracy. The evaluation was compared to human performance with hit/missanalysis. The model was able to predict each sample correctly with a minimum flaw size of0.9 mm and with zero false calls. The minimum flaw size found by a human inspector was2.7 mm, with a much higher number of false calls. This showed superhuman performanceon the concerned NDT-task where incredibly noisy data was assessed. A limitation of thisresearch was the limited number of flaws that does not capture the variability of naturalflaws.

Page 37: Reliable defect detection using machine learning for

30

4 Assessing phased array data with machine learningThe reliability assessment in NDT is often challenging due to large data sets requirementsto ensure confidence at a sufficient level. Therefore, these data sets may require many sam-ples from ultrasonic experiments, increasing the expenses from material and labour. (Subairet al., 2015) Despite the high costs, such quantitative reliability systems are essential in thescope of various maintenance plans in nuclear power plants (Kurz et al., 2013).

Advancements in ML have shown potential in various NDT tasks. One of the main reasonsthat have enabled ML in practical applications is the increased data acquisition methods thatgenerate large-scale data sets. Without a feasible number of variable data points, the networkis unable to identify typical flaw patterns. Therefore, state-of-the-art ML models can benefitfrom the phased array systems to facilitate and improve inspection quality.

ML is considered to solve difficulties that regulate modern ultrasonic inspections. Austeniticstainless steel welds typically possess very complex acoustical properties that complicatewave propagation. This results in demanding environments for the human inspector that issensitive to errors caused by the inspector or equipment. Therefore, the qualifications arenecessary. Moreover, the quality of inspection can suffer from the increased data sizes thatburden the inspector. ML could mitigate the burden of the repetitive task by taking advantageof the large-scale data set that the modern PAUT systems provide and maintain homogenousquality while processing the data. ML has also proven to reduce the need for feature engi-neering by learning the features from the data.

It has been observed that human inspectors are very good at localising and identifying flawsfrom highly complex signals but have missed some of the critical flaw sizes during qualifi-cations. An ML model can be trained with various flaw sizes and tested with plenty of dataso that human errors can be prevented. Large-scale data sets may enable models to learndynamical properties or patterns of the defects so that it can be segregated from the noisybackground. Therefore, ML models may surpass human inspectors by identifying smallerflaws based on their characteristics. Remaining responsibility for the human inspector is tokeep homogenous scanning quality to provide the model with rich data and the final judge-ment.

Besides the large sets of representative data for the qualification of human inspectors, evenlarger data sets are required for the ML models. The reason is that model generalisationcapabilities stems from the quality of the training data. Moreover, an ML model requirestrue quantitative evidence of the concerned application to be capable of providing valuableclassification performance. For mitigating problems that stem from small data sets, methodssuch as data augmentation are typically used. This method increases the number of repre-sentative data by applying transformations to the original data without compromising train-ing performance.

ML could be implemented to UT in different ways. A naïve approach would be where bothhuman and ML model inspects the data. The final judgment is done by the human where theML results are either used or not. This method does not mitigate the workload of the inspec-tor. The second extremity is to let the model process the data without human observationand judgement. This method would have potential but is not sensible before future qualifi-cation. Currently, a good approach would be to feed the model with the data and use the

Page 38: Reliable defect detection using machine learning for

31

results to establish expert and final judgment, which would be conducted by the human in-spector. (Virkkunen, Bolander, Myöhänen, & Miorelli, 2019)

As a result, the use of ML in the field of NDT has become more attractive. Therefore, thisresearch studies the implementation and feasibility of a state-of-the-art deep convolutionalneural network in the ultrasonic inspection of 316L austenitic stainless steel welds. The re-sults are based on the evaluation of the model by focusing on the reliability measurementsof unseen ultrasonic B-scans.

4.1 Data acquisition and augmentationThe inspection target of this study was 316L austenitic stainless steel weldments that aretypical throughout primary circuits and emergency feedwater piping in nuclear power plants(NPPs). The typical flaws for these structures when including flaws from manufacturing arelack of fusion, lack of penetration, porosity, and root crack. The inspection samples werethree flaw-free austenitic welds that represent the empty canvas for flaw embedding. Thethickness and details of the welded samples and homogenous samples with flaws are pre-sented in Table 1. The first three rows are the weldments (two 30 mm and one 20 mm thickwelds). The thicker samples were welded with mechanised shielded metal arc welding(SMAW) and gas tungsten arc welding (GTAW). The one thinner 20 mm weld was weldedutilising GTAW.

The flaw data was extracted from 11 homogenous austenitic stainless steel plates of samegrade and thickness of 20 mm supplied by Trueflaw. These samples included 16 true thermalfatigue cracks. This flaw manufacturing technique ensures controlled crack growth. Theseflaws were later embedded to the welded samples to obtain representative weld samples byusing the virtual flaw. The virtual flaw has successfully been applied for training and quali-fication of human inspectors in the probability of detection (POD) estimations. The dimen-sion of each flaw is listed in Table 2. The method of flaw implantation is described below.

Table 1 Inspection samples. Weldments are the empty canvases forflaw embedding. Base material is the homogenous plating contain-ing thermal fatigue cracks. (Siljama et al., 2020)

Page 39: Reliable defect detection using machine learning for

32

Setup

All test objects were scanned with a dual matrix array probe with an array layout of 7 4where 28 elements were for both transmitting and receiving placed on a rexolite wedge. Thecontact medium was water from a spray canister. The active aperture of the probe was19 12 mm. The probe possessed a frequency of 2.25 MHz. The concerned setup con-structed the transmit-receive longitudinal (TRL) inspection, where longitudinal are emittedto the test samples. According to Singh (1983), longitudinal waves are preferred for theirlesser velocity variance in anisotropic material resulting in less beam skewing, with the costof lower SNR.

The focal laws were established to range from 40° to 70° with 1° precision so that 55° wascentred at the weld root. The one-degree step determines the angular resolution providingelementary A-scan data of each angular step that is recorded for information in the post-processing. Therefore, these steps are referred to as separate channels (31 in total) each con-taining different values in matrices but from the same location on the scan line. Scan reso-lution was 1 mm during the scan line of 471 mm. Probe movement was monitored with asingle encoder, and deviation was prevented with a stationary rail. The recorded sound pathwindow ranged from 3.46μ s to 27.75 μs with a temporal resolution of 0.01μs.This setupyielded a size of 2429 31 for each step on the scan line.

Sensitivity calibration of the test equipment was conducted on an FSH S20 stainless steeltest block containing a machined 1 mm notch. The gain was adjusted to reach 80% of themaximum amplitude at 55° on the notch.

Data collection was accomplished by a certified level 3 inspector using Zetec TOPAZ64 byencoding the ultrasonic data to signed 16-bit integer numbers (int16) data and then recorded.

Samples for training

For obtaining a sufficient and balanced data set of the representative data, data augmentationmethods were performed. Flaws were extracted from the base material samples and

Table 2 True dimensions of each thermal fatigue crack. Numer-ation correspond to the notes of Table 1. (Siljama, Koskinen,Jessen-Juhler, & Virkkunen, 2020)

Page 40: Reliable defect detection using machine learning for

33

embedded into flaw-free weldments using the virtual flaw. Moreover, during the flaw em-bedding, this method applies random transformations to introduce more variability that en-ables the production of a nearly unlimited number of representative samples.

Each flaw of the base material samples was scanned from 5 different locations (+10, +5, 0,-5, -10 offset) to provide 80 raw flaw signals in total without the characteristics of the weld-ment and, therefore, could not be utilised as canvases. The flaw-free welds were scannedfrom both sides producing six different raw canvases for augmentation. One side of the thin-ner weld was scanned with a different setup, and therefore, could not be utilised.

Every channel of each sample was clipped to the 48 A-scans and 1020 pixels around theregion of interest. Therefore, the size of each sample was 48×1020×31 1.52M in total. TheA-scans were shifted by applying random-walk offset that simulates probe jitter that ran-domly shifts the A-scans in the temporal dimension. The samples were generated utilisingthe following practice:

(1) A flaw-free sample was taken from the empty weld canvas and clipped to thearea of interest. For each empty canvas, a set of 500 000 samples generated50 batches in total (10 000 samples per batch).

(2) With a uniform sampling of flawed and flaw-free samples:(a) If flaw:

(i) A random flaw was picked and embedded in a random location.(ii) Amplitude was decreased with a random factor yielding from 0.4-1 to

simulate reduced flaw size.(iii) The flaw A-scans was then shifted with a random-walk offset.(iv) The area of interest 48×1020 was clipped so that the entire flaw was

within the window.(v) A secondary random-walk offset was conducted on the A-scans.

(b) If flaw-free:(i) The area of interest 48×1020 was clipped.(ii) A random-walk offset was applied on the A-scans

The augmented data was stored to compressed binary files that were approximately 500 GBin total. A part of the preprocessing, described below, was also applied during data augmen-tation for efficiency and fast data transfer (reduced data size). The raw augmented data wasalso stored to enable validation of preprocessing. In addition, separate label files were stores,containing binary flaw label, flaw type, augmentation factor and bounding box coordinatesfor each sample. When the flaw was absent, all the corresponding label data were zeros.

4.2 Preprocessing the dataThe ultrasonic data acquired with dual matrix array usually consist of much redundant in-formation that burdens model learning. For this reason, it is feasible to preprocess the rawdata to facilitate learning by reducing the size of the model input. Preprocessing requires aprerequisite understanding of the information content so that no crucial information is lost.By applying effective preprocessing data pipelines, computational resources can be utilisedmore efficiently and implemented together with the model without compromises in learningor inference.

Page 41: Reliable defect detection using machine learning for

34

To be able to sample ultrasonic data of austenitic stainless steel welds, plenty of A-scans arerequired to capture the waveform. The ultrasonic waves with the phased array system aresignificantly frequency filtered because of the resonance of the probe. Due to the filtering,the data contain less critical information of the flaw and, therefore, should be reduced. More-over, the human inspector does not usually take advantage of the frequency spectrum; theraw data is often presented, rectified with channels merged. Therefore, a similar kind ofapproach is applied to the data for training the model.

After the previously described data augmentation step, the data was rectified, which enabledpooling by a maximum operator with a window size corresponding to 1 2 wavelength, asdone by Virkkunen et al. (2019). The wavelength pooling ensured that the minimal flawinformation is not lost while reducing the size of a single sample to 48 34 31 50 592 .The previously described preprocessing steps with flaw embedding are visualised in Figure14. After pooling each channel, the data was stored to compressed binary files. Ultimately,this was the data fed to the input pipeline for on-the-fly preprocessing integrated into themodel training.

After the secondary storing, predetermined sets of training and testing were selected fromthe compressed binary files. The training set consisted of samples from four welds while thelast was left for testing the model. In addition, a batch of 10 000 samples from one of thewelds used for training was stored for validation to measure generalisation performance dur-ing training and store the best performing model. The samples were from the same weldsdue to the low number of weld canvases available. The samples containing flaw depths upuntil 3.3 mm, seen in Table 2, were observed to be unidentifiable if embedded a locationwith high weld noise. Therefore, in training and validation, these were dropped from the datasets before the second binary compression to avoid an excessive number of false calls, de-scribed below. For comparison, a model was also trained with all flaw sizes. Moreover, eachchannel of the samples was considered separately and stored to 48 34 with separate labels.

Figure 14 Preprocessing of ultrasonic data. First the multichannel flaw data is embedded to the weldment and then stored.Subsequently, each channel is then rectified and pooled.

Page 42: Reliable defect detection using machine learning for

35

Two different preprocessing approaches were also tried in addition to the previously de-scribed that was utilised. In the first approach, the channels were merged by calculating theaverage that reduced the size of the samples from 48 34 31 to 48 34 1 . It was observed

Figure 15 The effect of flaw sizes in training the ML model. Here two scenarios arepresented. a) Model A show predictions when all flaw sizes are included. This resultsto uncertainties in both flaw-free and small flaw samples with low SNR, since modelassociates small flaws with noise. b) In the second scenario, Model B is trained bydropping flaws below a threshold preventing it from associating smaller flaws withnoise. In Model B, the flaw-free is correctly classified with high confidence but thesmaller flaw is missed.

Page 43: Reliable defect detection using machine learning for

36

that numerous samples maintained the flaw information in them; however, individual largerflaws faded in the result of the merging process. In the second approach, all channels werefed to the network simultaneously, providing maximal information to the network.

The primary purpose of omitting small flaws during training is to prevent the model fromassociating noise to the undetectable flaws. If smaller flaws with low SNR were included intraining/validation sets, the number of false calls would increase rapidly in testing. The trade-off of training with larger flaws exclusively would result in significant class imbalance,skewed towards the flaw-free samples and poor POD-curve. Class imbalance in this task ismitigated by introducing more minority class samples to the training set with data augmen-tation. A poor POD-curve denotes that the model can have excellent training performancebut is very poor in predicting smaller but visible flaws to attain human-level performance.By omitting the small-sized flaws allows the model to learn flaw properties that may besegregable from the highly noisy background, instead of flagging false calls on every ampli-tude peak.

Figure 15 presents an example of the training with and without the small flaws. Withoutdropping for the training makes the model associate flaws with noise resulting in an exces-sive number of misclassifications and uncertainty in training and testing. When the trainedwith the larger flaws, the uncertainty of misclassifying flaw-free samples is reduced but in-creases the probability of misclassifying small flaws with low SNR during testing. The sec-ond scenario should be considered superior due to higher confidence for classification. Mostimportantly, misclassifying flaw-free samples is more critical than misclassifying nearly un-detectable flaws.

Ultimately, the data was fed to the model's input pipeline, including on-the-fly prepro-cessing. Here, the single-channel samples were converted from signed 16-bit integers to 32-bit floating-point numbers and divided by 16384 to scale the data to range from 0-2 with themajority of values ranging from 0-1. This was conducted in place of zero mean and unitvariance scaling since the ultrasonic data does not follow the same principles as typical dig-ital images. Instead, it is desired to maintain the rectified form. Moreover, since the data isdivided with a constant value, high and low amplitudes of large and small flaws, respec-tively, are maintained, while a per image normalisation can corrupt this information.

4.3 Machine learning modelNetwork

The model was constructed utilising Keras framework that operates as a high-level API ontop of TensorFlow. Keras is an open-source neural network library written in Python, sup-plying all the neural network operations used in training. All model related scripts, after thelast binary compression, were performed in CSC clusters providing resources for Tesla V100GPU, enabling swift model learning.

The constructed model resembles the architecture of VGG16, proposed by Simonya andZisserman (2014). The network consists of 4 sequential convolutional blocks. Each blockconsists of two consecutive convolutional layers, except the last block with one single con-volution layer, with kernel sizes of 3 3 and ReLU non-linear activations. These layers arefollowed by max-pooling layer, and finally, a batch norm layer that normalises the distribu-tions to the following block, introducing more robustness to the network by reducing theinternal covariate shift between distributions of the layers and have regularising effect (Ioffe

Page 44: Reliable defect detection using machine learning for

37

Figure 16 Used model to train the classifier. The model consists of three consecutive convolutionalblocks followed by the final classification layers. First layers feature maps and filters are also visual-ised since at deeper level feature map representations become more abstract. It can also be seen thatmax-pooling layer reduces the spatial dimension, while maintaining the flaw in view.

Page 45: Reliable defect detection using machine learning for

38

& Szegedy, 2015). The final max-pooling layer reduces the size of the feature maps to 1 1that are then vectorized to a single dimension for final classification layers. These were thenconnected to a dense layer with ReLU activation with units corresponding to the number ofthe feature maps from the last convolution. Ultimately, these units converged to a singleclassification unit with sigmoid activation, indicating if a flaw is present, via a 60 % dropout.

The network architecture is presented in Figure 16. Here, the data flow from pooled channelsto output are visualised. Feature maps are only captured from the first layer since at deeperlevels they are incomprehensible. Feature maps demonstrate features that activate duringconvolution with the filters. These filters are learned during the training of the model toobtain the best representation of the task. Therefore, the convolution filters in Figure 16should be able to capture the flaw characteristics, providing the best outcome via activationfunctions. When the feature maps are max-pooled, the network is introduced with translationinvariance that loses locations information. Therefore, similar activations are expected re-gardless of flaw location. The architecture yields 834 725 trainable and 904 non-trainableparameters.

The used loss function was binary cross-entropy due to the binary task. In computing thenew weight of the model after each batch, Adam optimiser was utilised that takes advantageof bias-corrected first-moment and second-raw moment estimates (Kingma & Ba, 2014).

Hyperparameters

The hyperparameters were established based on trial and error. A learning rate of 0.0003demonstrated good convergence with Adam optimizer without immense variation in lossand accuracy curves in adjacent epochs. Sufficient batch size for the task was 128. The smallbatch sizes with respect to the data set size can provide some boost in generalisations sincemore variety between batches influences the gradient update. The number of epochs was setto 100 but stopped if the model did not learn during 20 consecutive epochs, measured basedon the validation set. Moreover, the model was saved after epochs based on its generalisationperformance on the validations set. This ensured that only the best model would be utilisedduring out-of-sample testing.

Page 46: Reliable defect detection using machine learning for

39

5 Results

5.1 Phased array inspection results

The given setup for ultrasonic testing yielded quality scans of the flaw from the homogenousaustenitic stainless steel plates. Even the smaller sized flaws were detected with the system.This is due to the material homogeneity and the advanced phased array systems with attrib-utes well suited for inspection of austenitic stainless steels. The ultrasonic inspection of theaustenitic stainless steel welds produced highly noisy canvases that are governed by highheterogeneity, causing sound waves to attenuate faster.

When the flaw data was embedded to the weldments with augmentations, using the virtualflaw, the smaller flaws coalesced with the noise and became undetectable. Some of the me-dium-sized flaws were also difficult to detect if embedded to noisy areas of the weld. Inconclusion, the flaw embedment demonstrated the expected behaviour of austenitic stainlesssteel weld for flaw signals.

Figure 17 shows three samples of the same flaw id (230BCB1740) with nearly equal depths.The difference in depths stems from data augmentations random factor. The left and rightsample show the flaw clearly within the bounding box while the middle sample display whenthe high noise makes the signal nearly undetectable. This type of sample can be harmful tothe learning, for the same reason as described in Section 4.2 with the small flaws. The modelmay associate these with noise that can result in instability during predictions. However,dropping the concerned sizes would drop samples similar to the left one in Figure 17, whichreduces the number of unique flaws from the training set.

5.2 Machine learning resultsThe target of this research is to evaluate the feasibility of ML in the ultrasonic inspection ofaustenitic stainless steel welds B-scans. A successful evaluation could improve the qualityof inspection in the application. Both state-of-the-art PAUT and ML were utilised to reachthese objectives.

A separate testing set of previously unseen ultrasonic scans were predicted to measure thetrue performance of the ML model and to observe overfitting. The testing set consisted of1000 samples with approximately 50% scans containing flaw and 50% scans without flaw.In this set, all flaw sizes were included to measure the true generalisation performance. Theperformance was evaluated based on the model's false call rate and hit and miss POD resultson the testing set. As a result of data augmentation, the number of samples available for POD

Figure 17 Effect of noise in B-scans. Samples from same weld with same flaw type and nearly similar depths. Depth ismarked in the left corner of each sample.

Page 47: Reliable defect detection using machine learning for

40

is very high compared to traditional POD exercise for humans where typically 60 samplesare provided.

The two unsuccessful preprocessing approaches resulted in adverse behaviour during test-ing. The channel merging showed good behaviour in generalisation but had some severemisses of large individual flaws. The reason is that some of the larger flaws concealed duringmerging. Simultaneously feeding the network with the 31 channels resulted in very highoverfitting. This was likely due to the limited number of flaw-free canvases that enabled themodel to memorise the background. By providing more flaw-free canvas data, this approachmay prove superior to the used method.

When all flaw sizes were included in the training, the model showed unexpectedly goodPOD performance. However, this was achieved with an immensely high false call rate of 14%. This was since the model associates the noise with small flaws, as discussed in Section4.2. When the model was trained with flawed samples above 3.3 mm, good POD resultswere obtained ( 90/95a 2.1 mm ) with a 2.3 % false call rate. Thus, demonstrating good gen-eralisation performance on the highly complex and noisy data set. The POD-curve is shownin Figure 18. Here the circles in 100 % POD demonstrate correctly classified flaws. 90/95a isestimated from the 90/95 lower confidence bound, indicating that flaws above this thresholdare likely to be identified.

The misses in the prediction are flaws below the trained flaw sizes. However, numeroussmaller flaws were correctly classified. This is probably due to similar dynamical propertiesas the larger-sized flaw, triggering similar activations of the model. Moreover, these samplesmay have embedded to a less noisy background that would otherwise attenuate the flaw

Figure 18 POD results from predicting the approximately 500 flawed samples. As seen in the top left, the model is able tocorrectly classify smaller samples. (Siljama, Koskinen, Jessen-Juhler, & Virkkunen, 2020)

Page 48: Reliable defect detection using machine learning for

41

signal. Therefore, this prediction demonstrates the model’s true generalisation performance.As a result, the model behaves as hoped.

Page 49: Reliable defect detection using machine learning for

42

6 DiscussionThe result of this research demonstrated state-of-the-art ML models capacities in highlycomplex inspection targets. The evaluation of the model showed that an ML model extendsto multi-channel data from phased array inspection. The key enabling this is the rich, large-scale data sets that have become available due to sophisticated data augmentation techniquesthat can generate representative samples from class imbalanced data sets (lack of flawedsamples).

Besides, the evaluation demonstrated valuable results; no misses were observed with respectto the trained flaw sizes. Valuable results were also observed in the POD estimations, wherea good 90/95a value was obtained with low confidence bound. Thus, demonstrating DCNNcapability in analysing of complex multi-channel austenitic stainless steel B-scans.

These results validate the capability of DCNN regarding PAUT image classification task.The findings demonstrate that DCNN can be applied to practical applications in the nearfuture. Here, the inspector would remain responsible for maintaining the UT inspection butprovides the network with the scanned data. Moreover, the inspector could utilise ML tolocalise flawed region from the inspection targets, where after verifying and applying expertjudgment in flaw evaluation. At first, the model should be validated regarding the applicationsince it is limited to the data that it is trained with; the testing data should have high relevanceto the training data. Therefore, careful monitoring of the model is initially required.

UT practitioner benefits from ML by reducing the burden of going through large-scale andrepetitive data sets, of mostly flaw-free samples. By utilising state-of-the-art ML, the datacan be analysed by maintaining homogenous inspection quality, thus reducing the humanerrors caused by the nature of the task, stress, or personal matters. As a result, assessment ofthe structural integrity of austenitic stainless steel welds is facilitated by increasing the reli-ability of inspection in critical components of NPPs.

During flaw embedment and preprocessing a few errors were observed. Initially, some ofthe larger-sized flaws, i.e. 3.5 mm deep, were undetectable in some B-scans. This wasdue to the cropping beyond the weld canvas, which caused dampening of the flaw. This issuewas diminished by limit the clipping area. Moreover, an unsuccessful preprocessing methodwas applied to reduce the size of the input data during preprocessing. Usually, when thechannels were merged, the flaw was visible, but occasionally the flaw was concealed. Thisresulted in learning instabilities, as discussed earlier. Therefore, merging the channels wasomitted.

The present study showed that the selection of flaw sizes had a significant impact on the MLperformance. Concealed flaw signals could result in excessive false call rates. Therefore, itshould be ensured that the samples supplied to the model would contain the flaw signal.Therefore, in this research, the smaller sized flaws were omitted to avoid this issue.

False calls performance of this research could easily be improved by obtaining more flaw-free data. Flaw-free data is more available in contrast to flawed data so that acquiring moredata should not be difficult. The tested preprocessing method that includes all the channelsin training could also benefit from an increased number of flaw-free canvases since its po-tential was shadowed by its tendency to overfit on flaw-free samples. Therefore, it would be

Page 50: Reliable defect detection using machine learning for

43

of interest to apply multi-channel data directly to the model to enable the combining of chan-nel information in training.

Earlier studies have shown neural networks potential in UT. Munir et al. (2019) applied ashallower convolutional network on A-scans in the classification of different flaw types.Here the available data set consisted of only 760 samples that were further augmented byshifting the signal, increasing the data set to 3600. With the given data set, the network’sclassification capabilities were respectable. To further, increase the performance, more sam-ples should be obtained. On this study, the data augmentation generated 500 000 samplesutilising five weld and 80 flaw signals. Therefore, this study did not suffer from a lack ofrepresentative samples, but lack of flaw-free canvas data.

Of future interest would be to apply object detection neural networks to UT tasks. In thesenetworks, the model is supplied with more supervision of the flaw coordinates during train-ing. During testing, the model would then predict the location of the flaw if present. Thiswould mean that the model architecture would require to maintain spatial information, thus,requires more complexity, with multiple branches and loss functions. Features from FasterR-CNN and YOLOv3 could be implemented, such as ROIs or bounding boxes applied intraining.

Page 51: Reliable defect detection using machine learning for

44

7 LimitationsThis research had a few limitations regarding the number of flaws and welds available.Trueflaw supplied 16 unique flaws, which is a significant improvement to the earlier studyby Virkkunen et al. (2019) with three unique cracks. Moreover, 16 cracks were scanned from5 different location providing 80 flaw signals in total for further augmentation. Moreover,there were only three weld samples where one weld had usable data from one side of theweld, due to inconsistency in inspection setup. This resulted in some restrictions since thenatural variation between different welds is not that well captured, resulting in misclassifi-cation of flaw-free canvases during model prediction. Additionally, it would of interest toevaluate the ML model for true austenitic stainless steel samples with natural flaws.

Page 52: Reliable defect detection using machine learning for

45

8 ConclusionThis research studied the feasibility of machine learning in analysis of multichannel austen-itic stainless steel weld B-scans. This performed to improve the reliability of inspection incritical components, with very complex structural properties, to verify structural integrity.The study was conducted by applying sophisticated data augmentation techniques to en-richen the data set to enable the use of deep convolutional networks (DCNN) with high pre-cision, without relying on a small data set. Moreover, usually, the ultrasonic data is veryclass imbalanced where the flawed samples are rare. Data augmentation yielded 500 000samples for each weld that was utilised for training/validation and testing. After distributingthe data to each set, the data was preprocessed and treated individually regardless of beingmulti-channel data. From the training and validation sets, the smaller flaws were omitted toprevent the model from associating flaws and noise signals. Subsequently, a DCNN withfour convolutional blocks followed by two dense layers, with dropout, for classification wastrained and tested.

The results demonstrated good POD performance (a90/95=2.1 mm) with tight confidencebound and a false call rate of 2.3 % in classification of 1000 samples B-scans. These resultsare considered good when comparing to data set size typically used in POD-evaluation (~60samples). These results verify ML models potential in phased array ultrasonic inspection ofhighly complex structures.

AcknowledgementsSuisto Engineering contributed austenitic stainless steel welds. DEKRA contributed phasedarray ultrasonic inspection. Trueflaw contributed to data augmentation. SAFIR2022 contrib-uted with financial support. Their support is greatly acknowledged.

Page 53: Reliable defect detection using machine learning for

46

ReferencesAbadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., . . . Zheng, X.(2016). TensorFlow: Large-scale machine learning on heterogeneous distributed sys-tems. Retrieved from https://arxiv.org/abs/1603.04467

Ali AH., Balint D., Temple A., Leevers P. (2012) The reliability of defect sentencing inmanual ultrasonic inspection. NDT & E International 51, 101-110.DOI:10.1016/j.ndteint.2012.04.003

Annis, C. (2009). Mil-hdbk-1823a, nondestructive evaluation system reliability assess-ment. technical report.

ASTM. (2015). E3023 standard practice for probability of detection analysis for â ver-sus a data American Society for Testing and Materials. DOI:10.1520/E3023-15

ASTM. (2018). E2862 standard practice for probability of detection analysis forhit/miss data American Society for Testing and Materials. DOI:10.1520/E2862-18

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2018). Reconciling modern machine learn-ing practice and the bias-variance trade-off. Retrieved fromhttps://arxiv.org/abs/1812.11118

Bhaduri, A., & Laha, K. (2015). Development of improved materials for structural com-ponents of sodium-cooled fast reactors. Procedia Engineering, 130, 598-608.DOI:10.1016/j.proeng.2015.12.276

Chassignole, B., El Guerjouma, R., Ploix, M., & Fouquet, T. (2010). Ultrasonic andstructural characterization of anisotropic austenitic stainless steel welds: Towards ahigher reliability in ultrasonic non-destructive testing. NDT & E International, 43(4),273-282. DOI:https://doi-org.libproxy.aalto.fi/10.1016/j.ndteint.2009.12.005

Chollet, F. (2018). Deep learning with python (1st ed.). Shelter Island, NY 11964: Man-ning Publications Co.

Chollet, F., & et. al. (2015). Keras. https://Keras.io.

Cruz, F., Simas Filho, E., Albuquerque, M., Silva, I., Farias, C., & Gouvêa, L. (2017).Efficient feature selection for neural network based detection of flaws in steel weldedjoints using ultrasound testing. Ultrasonics, 73, 1-8. DOI:https://doi-org.lib-proxy.aalto.fi/10.1016/j.ultras.2016.08.017

Dorafshan, S., Thomas, R., & Maguire, M. (2018). Comparison of deep convolutionalneural networks and edge detectors for image-based crack detection in concrete. Con-struction and Building Materials, 186, 1031-1045. DOI:https://doi-org.lib-proxy.aalto.fi/10.1016/j.conbuildmat.2018.08.011

Finnish Standards Association, S. (2017). SFS-EN 22825:2017. non-destructive testingof welds – ultrasonic testing – testing of welds in austenitic steels and nickel-based al-loys

Page 54: Reliable defect detection using machine learning for

47

Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals ofEugenics, 7(2), 179-188. DOI:10.1111/j.1469-1809.1936.tb02137.x

Garbin, C., Zhu, X., & Marques, O. (2020). Dropout vs. batch normalization: An empiri-cal study of their impact to deep learning. Multimedia Tools and Applications,DOI:10.1007/s11042-019-08453-9

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies foraccurate object detection and semantic segmentation. 2014 IEEE Conference on Com-puter Vision and Pattern Recognition, 580-587. DOI:10.1109/CVPR.2014.81

Gommlich, A., & Schubert, F.On determination of focal laws for linear phased arrayprobes as to the active and passive element size. Paper presented at the 19th World Con-ference on Non-Destructive Testing 2016, Retrieved from https://www.ndt.net/arti-cle/wcndt2016/papers/p139.pdf

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, Massa-chusetts ; London, England: The MIT Press.

Hinton, G. (2012). Neural networks for machine learning online course. Retrieved fromhttps://www.coursera.org/learn/neural-networks/home/welcome

Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012).Improving neural networks by preventing co-adaptation of feature detectors. Retrievedfrom https://arxiv.org/abs/1207.0580

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift. Retrieved from https://arxiv.org/abs/1502.03167

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Retrievedfrom https://arxiv.org/abs/1412.6980

Kolkoori, S., Rahman, M., & Prager, J. (2012). Effect of columnar grain orientation onultrasonic plane wave energy reflection and transmission behaviour in anisotropic aus-tenitic weld materials. Journal of Nondestructive Evaluation, 31(3), 253-269.doi:10.1007/s10921-012-0140-1

Krautkrämer, J., & Krautkrämer, H. (1977). Ultrasonic testing of materials. Berlin:Springer.

Kumar, S., Menaka, M., & Venkatraman, B. (2019). Simulation and experimental analy-sis of austenitic stainless steel weld joints using ultrasonic phased array. MeasurementScience and Technology, 31(2), 024005. DOI:10.1088/1361-6501/ab48a3

Kurz, J., Jüngert, A., Dugan, S., Dobmann, G., & Boller, C. (2013). Reliability consider-ations of NDT by probability of detection (POD) determination using ultrasound phasedarray. Engineering Failure Analysis, 35, 609-617. DOI:10.1016/j.eng-failanal.2013.06.008

Page 55: Reliable defect detection using machine learning for

48

Mishkin, D., Sergievskiy, N., & Matas, J. (2017). Systematic evaluation of convolutionneural network advances on the imagenet. Computer Vision and Image Understanding,161, 11-19. DOI:10.1016/j.cviu.2017.05.007

Munir, N., Kim, H., Park, J., Song, S., & Kang, S. (2019). Convolutional neural networkfor ultrasonic weldment flaw classification in noisy conditions. Ultrasonics, 94, 74-81.DOI:https://doi-org.libproxy.aalto.fi/10.1016/j.ultras.2018.12.001

Munir, N., Kim, H., Song, S., & Kang, S. (2018). Investigation of deep neural networkwith drop out for ultrasonic flaw classification in weldments. Journal of Mechanical Sci-ence and Technology, 32(7), 3073-3080. DOI:10.1007/s12206-018-0610-1

Ploix, M., Guy, P., Chassignole, B., Moysan, J., Corneloup, G., & Guerjouma, R. E.(2014). Measurement of ultrasonic scattering attenuation in austenitic stainless steelwelds: Realistic input data for NDT numerical modeling. Ultrasonics, 54(7), 1729-1736.DOI:10.1016/j.ultras.2014.04.005

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Uni-fied, real-time object detection. 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), , 779-788. DOI:10.1109/CVPR.2016.91

Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. Retrievedfrom https://arxiv.org/abs/1804.02767

Satyanarayan, L., Kumar, A., Jayakumar, T., Krishnamurthy, C., Balasubramaniam, K.,& Raj, B. (2009). Sizing cracks in power plant components using array based ultrasonictechniques. Journal of Nondestructive Evaluation, 28(3), 111-124.DOI:10.1007/s10921-009-0053-9

Shaoqing Ren, Kaiming He, Girshick, R., & Jian Sun. (2017). Faster R-CNN: Towardsreal-time object detection with region proposal networks. IEEE Transactions on PatternAnalysis and Machine Intelligence, 39(6), 1137-1149.DOI:10.1109/TPAMI.2016.2577031

Siljama, O., Koskinen, T., Jessen-Juhler, O., & Virkkunen, I. (2020). Automated flawdetection in multi-channel phased array ultrasonic data using machine learning. Journalof Nondestructive Evaluation

Song, S., & Kim, H. (2000). An intelligent system approach to real-time ultrasonic flawclassification in weldments. JSME International Journal Series C Mechanical Systems,Machine Elements and Manufacturing, 43(1), 60-72. doi:10.1299/jsmec.43.60

Subair, S. M., Agrawal, S., Balasubramaniam, K., Rajagopal, P., Kumar, A., Rao, P. B.,& Tamanna, J. (2015). On a framework for generating PoD curves assisted by numericalsimulations. AIP Conference Proceedings, 1650(1) DOI:10.1063/1.4914817

Tremblay, P., & Richard, D. (2012). Development and validation of a full matrix cap-ture solution. Retrieved from https://www.ndt.net/article/jrc-nde2013/papers/220.pdf

Page 56: Reliable defect detection using machine learning for

49

Virkkunen, I., Bolander, M., Myöhänen, H., & Miorelli, R. (2019). Qualification of anAI / ML NDT system – technical basis. Technical report. European Network for Inspec-tion & Qualification.

Virkkunen, I., Koskinen, T., Papula, S., Sarikka, T., & Hänninen, H. (2019). Compari-son of â versus a and hit/miss POD-estimation methods: A European viewpoint. Journalof Nondestructive Evaluation, 38(4), 89. DOI:10.1007/s10921-019-0628-z

Workman, G. L., & Kishoni, D. (2007). Nondestructive testing handbook ; vol. 7: Ultra-sonic testing (Third ed.) American Society of Nondestructive testing.