Defect Detection and Quantification Toolbox (DDQT)
The domain of Nondestructive Testing (NDT) and Structural Health Monitoring (SHM) comprise of techniques that are used to evaluate the state of health of a structure without causing any damage to the structure being inspected. Typical examples of structures being inspected include aircraft components, bridges, nuclear reactors, etc. Defect Detection and Quantification Toolbox (DDQT) aims to provide a framework to automate the routine tasks for researchers in the field of NDT and SHM.
In the fields of NDT and SHM, experiments are performed using a variety of modalities like Ultrasound, Infrared Thermography, X-Rays, etc. Matlab is often used to perform experiments. Typically, the resulting data is very often a 3D dataset comprising of time axis and 2D spatial axis. Once the data is obtained, a variety of visualization checks are performed. Following this, features are created at each and every spatial location in the time series. Some examples of the types of features are based on time series, gradients, spatial filters, etc. At this stage, to avoid the curse of dimensionality, a feature reduction step is performed using PCA in case of unsupervised defect detection. Once the subset of meaningful features is identified, the defect regions are detected using a statistical distance metric like Mahalanobis distance. In DDQT, we also propose a realatively newer method to identify defects using Isolation Forests. Following this, the performance of the feature space and detection algorithms is quantified using Receiver Operating Curves (ROC) curves compared to the raw data. The performance is quantified using commonly used classification metrics such as True Positive Rate (TPR), False Positive Rate (FPR) and Area Under Curve (AUC). A visual representation of these metrics is presented using charts to aid the user.
While there are clearly defined steps that researchers in field of NDT and SHM often pursue, to the best of our knowledge there are no offerings that provide a framework so researchers can mainly focus on the feature space and tweaking the defect detection algorithms. With very minimal edits to the driver program, researchers can visualize the results with ease and focus on the underlying physics to improve the performance of defect detection instead of spending much time and effort on setting up the pipeline process.
In Summary, `DDQT` can be used for;
Reading in Matlab data
Visualizing data
Creating features in the time and spatial domain
Feature reduction using PCA
Identifying defects using Mahalanobis distance and Isolation Forest
Quantifying results using ROC curves
Visualizing outcomes
There are numerous avenues to enhance this toolbox. I welcome any contributions to this program. Some possible areas that could use improvements are;
Improvements in feature space
Improvements to defect detection algorithms
Coding enhancements
Documentation enhancements
Currently, only certain time stamps are used in calculating computationally intensive features. There is scope to write more computationally efficient code to handle more time stamps (if not everything…)
Possibility of including circular defects - currently, defects are defined using polygon vertices
If you would like to collaborate with me in improving this toolbox or if you would like to provide sample data, please reach out to me at
>>>my_first_name = 'arun'
>>>print(str(my_first_name) + 'mano121@outlook.com')
Feel free to fork and add any enhancements, and let me know if a pull request is needed to merge the changes.
If you use this work in your research, please cite using;
@software{ArunManohar_20210322,
author = {Arun Manohar},
title = {{Defect Detection and Quantification Toolbox (DDQT)}},
month = mar,
year = 2021,
publisher = {Zenodo},
version = {v0.1.0},
doi = {10.5281/zenodo.4627984},
url = {https://doi.org/10.5281/zenodo.4627984}
}
Thank you!
Readme
Defect Detection and Quantification Toolbox (DDQT)
Arun Manohar (2021)
`DDQT` can be used for;
Reading in Matlab data
Visualizing data
Creating features in the time and spatial domain
Feature reduction using PCA
Identifying defects using Mahalanobis distance and Isolation Forest
Quantifying results using ROC curves
Visualizing outcomes
There are numerous avenues to enhance this toolbox. I welcome any contributions to this program. Some possible areas that could use improvements are;
Improvements in feature space
Improvements to defect detection algorithms
Coding enhancements
Documentation enhancements
Currently, only certain time stamps are used in calculating computationally intensive features. There is scope to write more computationally efficient code to handle more time stamps (if not everything…)
Possibility of including circular defects - currently, defects are defined using polygon vertices
If you would like to collaborate with me in improving this toolbox or if you would like to provide sample data, please reach out to me at
>>>my_first_name = 'arun'
>>>print(str(my_first_name) + 'mano121@outlook.com')
Feel free to fork and add any enhancements, and let me know if a pull request is needed to merge the changes.
If you use this work in your research, please cite using;
@software{ArunManohar_20210322,
author = {Arun Manohar},
title = {{Defect Detection and Quantification Toolbox (DDQT)}},
month = mar,
year = 2021,
publisher = {Zenodo},
version = {v0.1.0},
doi = {10.5281/zenodo.4627984},
url = {https://doi.org/10.5281/zenodo.4627984}
}
Thank you!
Getting Started
Dependencies
In order to run the program, you need Python3 and the following dependencies.
Installing
Either use git-clone using the following command;
git clone https://github.com/arunmano121/DDQT.git MyDDQT
or manually download the two python files into your desired working directory. In the example MyDDQT is an example. You can use any name of your choice.
Running the program
cd into your working directory and run the program.
cd MyDDQT
./DefectDetection.py
Configuring Parameters
The program is designed so that all parameter settings need to be only edited within the main() module.
The following block is used to load Matlab data, this assume a dataset named sample.mat containing a table name rawData1. The output data is stored as ndarray and is named mat. This will be used for further processing.
print('Reading in raw matlab data using scipy IO modules...')
# Example - this assumes a matlab dataset named defect.mat and the
# table named rawData inside the dataset
dataset = 'sample.mat'
tablename = 'rawData1'
mat = read_matlab_data(dataset=dataset, table=tablename)
The data is assumed in the format containing time (axis=0) followed by spatial axis 1 (axis=1) and spatial axis 2 (axis=2) respectively. If the Matlab dataset contains data in a different axis order, re-arrange using numpy.moveaxis before proceeding to subsequent steps.
Data is described by the following block.
[t_max, s1_max, s2_max] = mat.shape
print('Shape of the data matrix')
print('t_max: %d s1_max: %d s2_max: %d' % (t_max, s1_max, s2_max))
The range of the three different axis is set.
# scanning parameters
# in this sample code, time axes ranges from t_lb to t_ub over t_max
t_lb = 0*1e-1
t_ub = t_max*1e-1
t = np.linspace(t_lb, t_ub, t_max)
# s1 axis range from s1_lb to s1_ub divided over s1_max steps
s1_lb = 0
s1_ub = 200
s1 = np.linspace(s1_lb, s1_ub, s1_max)
# s2 axis range from s2_lb to s2_ub divided over s2_max steps
s2_lb = 0
s2_ub = 250
s2 = np.linspace(s2_lb, s2_ub, s2_max)
Units along the three different axis is held in a dictionary named units. In this example, the time axis is defined in micro seconds, while the two spatial axis are in mm. Set the appropriate units based on the experiment.
# dictionary object to hold the units along the different axis
units = {'t_units': '$\\mu$S', 's1_units': 'mm', 's2_units': 'mm'}
For ease of plotting, the s1 and s2 axis are converted to 2D meshgrid.
# meshgrid conversion in 2D
s1_2d, s2_2d = np.meshgrid(s1, s2, indexing='ij')
Raw data is visualized at four random spatial points by charting the time series.
# raw data visualization
print('Pick 4 random spatial coordinates and chart the time-series...')
visualize_time_series(mat, t, s1, s2, units)
The spatial data is visualized at different time stamps as needed. In the example below, the spatial data is visualized between time indices of 450 (t_min_idx) to 500 (t_max_idx) in steps of 25 (del_t_idx).
print('Visualize spatial slices of data at certain time stamps...')
t_min_idx = 450
t_max_idx = 500
del_t_idx = 25
visualize_spatial_data(mat, t, s1_2d, s2_2d,
t_min_idx, t_max_idx, del_t_idx, units)
The raw time series is very noisy and often a low-pass filter is desired. In this example, the time series is filtered using a simple mean filter. The filter avergages using the size parameter. The bigger the number, the more aggressive the filtering is.
# time series filtering of data
print('performing mean filtering at each spatial location...')
mat = mean_filter(mat, t, s1, s2, units, size=20, plot_sample=True)
The defects are defined using the list structure. As many defects can be setup. The defects can be defined using as many vertices as needed. Each defect is a list of tuples. The defect names or labels are a list containing strings.
# define defects
print('Defining coordinates of defects...')
# define as many defects as needed
# each defect should contain the coordinates of the vertices
# the structure is list of tuples
def1 = [(20, 20), (50, 10), (30, 40), (20, 30)]
def2 = [(120, 120), (180, 120), (150, 180)]
def3 = [(60, 60), (80, 60), (80, 80), (60, 80)]
# list contains all the defects
defs_coord = [def1, def2, def3]
def_names = ['D1', 'D2', 'D3'] # names of defects
defs = define_defects(s1, s2, defs_coord, def_names)
Calculation of features at every time index is computationally intensive. A sample of time stamps in defined. t_stamps defines the indices at which features are calculated, and where performance is finally measured.
# sample time indices where computationally intentionally features
# will be calculated.
t_stamps = range(500, 800, 100)
Feature engineering is very important and is based on problem at hand and creativity of the researcher. Feel free to define additional features as necessary. In the sample, the following family of features are calculated.
Identity features.
# identity features
features_id = {}
features_id['id'] = mat
Gradient based features.
# compute gradient features
print('Calculating spatial and temporal gradients...')
features_grad = {}
features_grad = compute_features_grad(mat)
Spatial domain features are calculated at desired time indices defined above.
# compute spatial domain features
print('Calculating spatial features at every location and time...')
features_sd = {}
features_sd = compute_features_sd(mat, t_stamps)
Time domain features are calculated at desired time indices defined above.
# compute time domain features
print('Calculating temporal features at every spatial location...')
features_td = {}
features_td = compute_features_td(mat, t_stamps)
Wavelet decomposition features are calculated at desired time indices defined above.
# compute wavelet decomposition features
print('Calculating wavelet transformed features at every location...')
features_wav = {}
features_wav = compute_features_wav(mat, t_stamps)
Once features are calculated, it is often desired to visualize the feature. The visualize_features accomplishes this as shown below. In the examples, s1_grad and s2_grad features belonging to features_grad are visualized.
# visualize feature
print('Visualizing computed features...')
t_idx = 650
visualize_features(mat, features_grad, s1_2d, s2_2d, 's1_grad',
t_idx, t, units)
visualize_features(mat, features_grad, s1_2d, s2_2d, 's2_grad',
t_idx, t, units)
The input features across all families are now combined into a single feature family for further processing. combine_features function combines the family of features as defined in the list named feature_list.
# combine features
print('Combining all features from different methods into a dict...')
feature_list = [features_id, features_grad, features_sd,
features_td, features_wav]
features = {}
features = combine_features(feature_list)
print('Total number of features is %d' % (len(features)))
The features are scaled using the minimum and maximum values, so that the resulting features lie between 0-1. Scaling features has proven to be useful in Machine Learning.
# normalize features
print('Normalize features...')
features = normalize_features(features, t_stamps)
Outlier analysis is perfomed using two methods - Mahalanobis distance and Isolation Forest. If PCA is desired to reduce input dimensionality, set pca_var to the Desired Variance level. For example, if pca_var is set to 0.9, then it is implied that 90% variance is desired. Accordingly, PCA will choose the number of dimensions that are needed to achieve this. The result of Mahalanobis distance is output to the ndarray named mah.
# Outlier analysis using Mahalanobis distance
# if PCA is required to trim features, set pca_var to the desired
# explained varaince level - in this example, 90% variance is desired
print('Mahalanobis distance to identify outliers...')
mah = {}
mah = outlier_mah(features, t_stamps, pca_var=0.9)
Another popular method to detect outliers uses Isolation Forest method. The result is output to the ndarray named iso.
# fit Isolation Forest model
# if PCA is required to trim features, set pca_var to the desired
# explained variance level - in this example, 90% variance is desired
print('Fit Isolation Forest model...')
iso = {}
iso = fit_isolationforest_model(features, t_stamps, pca_var=0.9)
In order to better visualize the results contained in mah and iso, the frames are scaled between 0-1 using the minimum and maximum values of the arrays.
# scale frames between 0-1
print('Scaling frames between 0-1 for better interpretability...')
mat = scale_frames(mat, t_stamps)
mah = scale_frames(mah, t_stamps)
iso = scale_frames(iso, t_stamps)
defect_detection_metrics will compute the performance of the algorithms using True Positive Rate (TPR), False Positive Rate (FPR) and Area Under Curve (AUC) metrics. The function will also output the TPR at FPR rates of 2%, 5% and 10%. If plot parameter is set to True, the Reciever Operating Characteristic (ROC) curves are plotted to show the improvement obtained over the raw data.
# Defect detection metrics
print('Quantification of defect detection and plotting the results...')
defect_detection_metrics(mat, mah, iso, s1_2d, s2_2d,
defs, t_stamps, t, units, plot=True)
How to cite
if you use this work in your research, please cite using;
@software{ArunManohar_20210322,
author = {Arun Manohar},
title = {{Defect Detection and Quantification Toolbox (DDQT)}},
month = mar,
year = 2021,
publisher = {Zenodo},
version = {v0.1.0},
doi = {10.5281/zenodo.4627984},
url = {https://doi.org/10.5281/zenodo.4627984}
}
Thank you!
Code Reference
DefectDetection module
- DefectDetection.annotate_plots(ax, defs)
Annotate charts with locations of defects
- Parameters
- ax: axis object
plot axis
- defs: dict
defect parameters
- Returns
- None
- DefectDetection.combine_features(feature_list)
Combine all features from different methods into one single dict
- Parameters
- feature_list: list
list containing all entries of input features that need to be concatenated
- Returns
- features: dict
feature dictionary containing all the input features
- DefectDetection.compute_features_grad(mat)
Calculates spatial and temporal gradients
- Parameters
- mat: ndarray
raw data
- Returns
- features_grad: dict
dictionary containing spatial and temporal gradient features
- DefectDetection.compute_features_sd(mat, t_stamps)
Calculates spatial features at every location and time stamp
- Parameters
- mat: ndarray
raw data
- t_stamps: list
time stamps at which time domain features are calculated
- Returns
- features_sd: dict
dictionary containing spatial domain features
- DefectDetection.compute_features_td(mat, t_stamps)
Calculating temporal features at every spatial location
- Parameters
- mat: ndarray
raw data
- t_stamps: list
time stamps at which time domain features are calculated
- Returns
- features_td: dict
dictionary containing time domain features
- DefectDetection.compute_features_wav(mat, t_stamps)
Calculates wavelet transformed features at every location
- Parameters
- mat: ndarray
raw data
- t_stamps: list
time stamps at which wavelet features are calculated
- Returns
- features_wav: dict
dictionary containing wavelet features
- DefectDetection.defect_detection_metrics(mat, mah, iso, s1_2d, s2_2d, defs, t_stamps, t, units, plot=True)
Quantification of defect detection, and plotting the results
True-Positive Rate (TPR), False-Positive Rate (FPR), Receiver Operating Curves (ROC) are calculated for the raw data, Mahalanobis distance and result of Isolation Forest method. In addition, Area Under Curve (AUC) is also calculated to quantify the performance. Often, performance in terms of higher TPR is desired at lower FPR. To aid this, TPR values are calculated at 2%, 5% and 10% FPR. Further, the results are presented graphically if needed.
- Parameters
- mat: ndarray
raw data - 3D float array
- mah: ndarray
result of performing Mahalanobis distance - 3D float array
- iso: ndarray
result of performing Isolation Forest algorithm - 3D float array
- s1_2d: ndarray
2D meshgrid representation of s1 axis
- s2_2d: ndarray
2D meshgrid representation of s2 axis
- defs: dict
defect parameters
- t_stamps: list
time stamps at which features were calculated and where results are desired
- t: list
time coordinates
- units: dict
units of the different dimensions
- plot: Bool
Boolean to indicate if plots are needed to visualize
- Returns
- None
- DefectDetection.define_defects(s1, s2, defs_coord, def_names)
Define coordinates of defects
- Parameters
- s1: list
spatial axis 1
- s2: list
spatial axis 2
- defs_coord: list
list containing all defects - each defect contains a list of tuples containing the vertices of defect
- def_names: dict
dictionary containing the names of defects
- Returns
- defs: dict
dictionary containing all the necessary parameters of all the defined defects
- DefectDetection.fit_isolationforest_model(features, t_stamps, pca_var)
Fit Isolation Forest model
- Parameters
- features: dict
dictionary containing all input features
- t_stamps: list
time stamps at which features were calculated and where results are desired
- pca_var: float
contains the desired explained variance parameter, if less than 1.0, PCA will be performed
- Returns
- iso: ndarray
result of Isolation Forest model over the data
- DefectDetection.main()
All the subroutines will be called from here
- DefectDetection.mean_filter(mat, t, s1, s2, units, size, plot_sample)
Performs mean filtering at each location
- Parameters
- mat: ndarray
raw data
- t: list
time axis
- s1: list
spatial axis 1
- s2: list
spatial axis 2
- units: dict
units of the different dimensions
- size: int
number of elements to use in the mean filter. The higher, the more aggresive the filtering
- plot_sample: Bool
Boolean to indicate if time series plots are needed to compare raw and filtered data
- Returns
- filt_mat: ndarray
mean filtered raw data based on kernel size
- DefectDetection.normalize_features(features, t_stamps)
Normalize features
- Parameters
- features: dict
dictionary containing all input features
- t_stamps: list
time stamps at which features were calculated and where results are desired
- Returns
- features: dict
dictionary containing all normalized features
- DefectDetection.outlier_mah(features, t_stamps, pca_var)
Mahalanobis distance to identify outliers
- Parameters
- features: dict
dictionary containing all input features
- t_stamps: list
time stamps at which features were calculated and where results are desired
- pca_var: float
contains the desired explained variance parameter, if less than 1.0, PCA will be performed
- Returns
- mah: ndarray
contains the result of computing Mahalanobis distance over the data
- DefectDetection.read_matlab_data(dataset, table)
Reads in raw matlab data using scipy IO modules
- Parameters
- dataset: ndarray
name of the Matlab dataset
- table: str
name of table within Matlab
- Returns
- mat: ndarray
matlab data that has been converted to numpy array
- DefectDetection.scale_frames(arr, t_stamps)
Scale frames between 0-1 for better interpretability
- Parameters
- arr: ndarray
input array that needs to be scaled
- t_stamps: list
time stamps at which features were calculated and where results are desired
- Returns
- outarr: ndarray
scaled array where the elements lie between 0-1
- DefectDetection.visualize_features(mat, features, s1_2d, s2_2d, feature, t_idx, t, units)
Visualize computed features
- Parameters
- mat: ndarray
raw data
- features: dict
dictionary containing input features
- s1_2d: ndarray
2D meshgrid representation of s1 axis
- s2_2d: ndarray
2D meshgrid representation of s2 axis
- feature: str
desired feature that needs to be visualized
- t_idx: int
time index at which visualization is needed
- units: dict
units of the different dimensions
- Returns
- None
- DefectDetection.visualize_spatial_data(mat, t, s1_2d, s2_2d, t_min_idx, t_max_idx, del_t_idx, units)
Visualize spatial slices of data at certain time stamps
- Parameters
- mat: ndarray
raw data
- t: list
time axis
- s1_2d: ndarray
2D meshgrid representation of s1 axis
- s2_2d: ndarray
2D meshgrid representation of s2 axis
- t_min_idx: int
lower bound time index for visualization
- t_max_idx: int
upper bound time index for visualization
- del_t_idx: int
time index steps for visualization
- units: dict
units of the different dimensions
- Returns
- None
- DefectDetection.visualize_time_series(mat, t, s1, s2, units)
Pick 4 random spatial coordinates and chart the time-series
- Parameters
- mat: ndarray
raw data
- t: list
time axis
- s1: list
spatial axis 1
- s2: list
spatial axis 2
- units: dict
units of the different dimensions
- Returns
- None
point_in_convex_polygon module
Helper module to determine if a point lies within a polygon
Script is based on Ref1 and Ref2.
- class point_in_convex_polygon.Point(s1, s2)
Bases:
object
Point class to define a point
- point_in_convex_polygon.is_within_polygon(polygon, point)
Determine if a point lies within the polygon
- Parameters
- polygon: list of points
polygon definition using a set of points
- point: class:’Point’
a single point
- Returns
- True/False: Bool
Depending on if point lies within polygon
License
BSD 3-Clause License
Copyright (c) 2021, Arun Manohar All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Acknowledgements
Thanks to Prof. Michael Todd, Dr. See Yenn Chong and Mr. Zihan Wu at University of California, San Diego, for valuable discussions and inputs.
Contact
Arun Manohar
>>>my_first_name = 'arun'
>>>print(str(my_first_name) + 'mano121@outlook.com')