AGUILA: An Automatic Tube Detection System

This paper discusses a system which uses machine vision algorithms for the detection of tubes in an extremely "noisy" industrial environment . The heart of the algorithm consists of sampling of the image in predefined sparse bands perpendicular to the likely orientation of the tubes followed by the application of a normalized correlation as a detection filter over these bands. After assigning a reliability factor to each local maxima of the correlation function, these points are then mapped to the Hough space to determine the equations of the midlines of the tubes. Given these equations, the number of tubes, and any positional anomalies are reported.


Introduction
A number of recent machine vision systems for industrial applications have been reported in the literature [1,2]. Some of these applications require robust solutions when sensing is done in harsh industrial environments. Examples include the detection and classification of defects in metal strips [3], inspection of aluminum castings [4], and ash line control in a sloping grate bark furnaces [5]. In the present paper, a part detection problem of an industrial plant is studied, and a prototype system is described.
One of the concerns of a large steel-tube fabrication plant is the counting of the tubes in the production line, fig. 1, and the detection of any positional anomalies associated with them, fig. 2. The tubes have varying diameters from 20 mm to 200 mm approximately and are transported by a conveyor which has rotating cylindrical wheels. In the following discussion, background image or simply the background refer to the conveyor frame, wheels and also any other strange object which could be seen from the top view when the conveyor is empty. These objects can be anything lying on the ground from shiny metalic pieces to tubes thrown on the floor in a heavy production day. A reliable detection system must distinguish between these artifacts and the tubes lying on top of the conveyor. The former is treated as noise, and the latter as the objects of interest. Moreover, the environment is exceptionally polluted, vibrations and misty air makes the job of capturing clear and focussed images impossible. Also, serious contaminations such as dirt, scratches caused by the coffisions of the tubes, and normal deformations due to the manufacturing process makes the reflection properties of the tubes quite irregular. In spite of that the system is required to perform with a reliability rate of 99.8 %. Just after the arrival of the tubes, there is a period of time (about 5 seconds) during which the tubes are stationary. In this period, the system is required to capture an image, count the number of tubes present and detect any positional anomaly and alarm the operator. In order to achieve this goal, the position and the orientation of the tubes must be detected. These stringent requirements call for very robust and efficient algorithms. The organization of the paper is as follows. Section 2 studies the detection problem. Section 3 shows the method to sample the image in sparse bands perpendicular to the likely orientation of the tubes and the filtering operation on the samples using the ifiter chosen in Section 2. Section 4 describes the method by which the locations where the presence of the tubes is more likely are mapped to the Hough space. The Hough method will be used for increased robustness in the detection of the position and the orientation of the tubes. A description of the positional anomalies of the tubes is then generated. Also, a detailed description of how to reliably obtain the (p, 9) of the central lines of the tubes is given. Section 5 describes the extensive experiments carried out, and finally future work and conclusions are presented in Section 6.

The Detection Problem
The chosen approach is a typical case of a template matching problem, in which a 2-D signal is compared with a model to detect any "similarity". This comparison function is generally a correlation ifiter. In this section, two issues are considered. First, the election of a suitable filter, and second, the shift from a 2-D analysis (and hence expensive) to an almost equivalent 1-D analysis of the scene.

The Filter
In [6], the use of a normalized correlation filter for alignment, gauging and inspection is studied. This filter, which measure the "similarity" between a portion of a scene and a reference pattern or template, has been used for many years by the research community. Examples include stereopsis [7], where the goal is to determine the position of a particular feature in both images of a stereo pair, and as part of the guidance system in an 484/ SPIE Vol. 1708 Applications of Artificial Intelligence X: Machine Vision and Robotics (1992) (c) autonomous robot vehicle [8], where the goal is to track the position of a feature through a time-sequence of images to infer vehicle motion. It has not, however, been widely used in practical machine vision applications because of the high computational cost when a 2-D analysis is involved. As an example, computing the normalized correlation function for a 500 x 500 pixel image and a 100 x 100 pixel model requires 8 billion operations. The cost is further increased because of the fact that the correlation function is not isotropic, therefore the template must be rotated in all possible angles when a 2-D analysis is involved. This problem can be relieved using specialized architectures (e.g. Cognex 2000 {6}) which for moderate model sizes (e.g. 32 x 32) and image sizes (e.g. 500 x 500) computes the normalized correlation function in real time. Also, the computational cost is drastically reduced if in some applications, like in this system, the image can be reduced to a set of uni-dimensional samples and the normalized correlation is computed over these slices.
In what follows, a model is considered to be the values of a set of pixels M2, each at specific relative offset (xi, yj), where i E 1..N and N is the dimension of the model. Note that the model support needs not be rectangular or even a connected region. The correlation coeficient r of a model and a corresponding portion of an image at an offset (u, v) is given by It can be shown that the values of r are always in the range -1 to 1. A value of 1 represents a perfect match between the portion of the image under consideration and the template, while a -1 indicate a perfect mismatch. Although one can use the values of r directly as a similarity measure, there are two reasons which make a value related to r2 preferable. The first reason is that it is much faster to square the numerator of the above equation than taking the square root of its denominator. Second, squaring r expands its range at the high end, where the values are interesting, and compresses its range at the low end, where they are not relevant. From now on, the normalized correlation function value, will be called the filter response, and will be defined as max(r, 0)2: the negative values of r are set to zero, and the positive values are squared.
It can easily be shown that if all of the image pixels I are replaced with new values Ii', such that 1 = aI + b for any a > 0 and b, the filter response is unchanged. The same is true for the model pixels M1. One can consider a and b as gain and offset parameters. The importance of this property is that the overall gain and offset of an image depends on many random parameters, which generally are very hard to control and account for. These include factors such as illumination intensity, scene reflectivity, and the gain and offset of the sensor and digitizer. According to [6], these properties make the normalized correlation search one of the best methods overall for locating image features in practical applications of machine vision.

2-D vs. 1-D Analysis
The most straightforward way of dealing with this detection problem is to design a 2-D model of the tube, and displace it vertically across the image. For each position, the model should be rotated in convenient angular steps to cover the range of all possible orientations of the tubes. Considering the size of the model (typically about 30 x 250 pixels) one can realize that this possibility is not practical, since the efficiency of the algorithm would not be acceptable for this application. SPIE VoL 1708 Applications of Artificial Intelligence X: M8chine Vision and Robotics (1992)1485 To overcome this, the sampling of the image in thin bands, followed by 2-D template matching inside each band is proposed. This has two advantages, first, the size of the model is drastically reduced, hence the algorithm is much more efficient. Second, for thin bands (a few pixels wide) there is no need for rotating the model, since the gray value distribution of a tube inside a thin band changes very little when rotated moderately.
In the following discussion, a rectangular model of width col2 -coil + 1 and length d pixels is compared with a region of the image confined between columns coil and col2. This portion of the image has d rows where d is the diameter of the tubes.
Itefering to formula (1), an expression for the number of operations involved is presented. Consider a ROW x COL image and a model of d x width. There are (ROWd + 1) positions of the model within the image, since in this case the model only moves vertically in the image. In formula (1), at each position one has to consider d x width pixels. Hence, the total number of the operations is d x width x (ROW -d + 1). In this analysis, one uses the fact that the quantities involved in the model are calculated offline. Therefore, the efficiency of the algorithm is determined by the following three computations, > I, > j2 and > IM. Thus 3 adds and 2 multiplies are required per pixel. Supposing that adds and multiplies have the same computationa1 cost, the number of operations involved is (5d x width x ROW), approximately.
As a consequence of the uniform overhead illumination, the 2-D model of a tube has constant triangular cross section. Taking advantage of this fact, it is shown that [9] formula (1) can be reduced to an equivalent formula Where M1, i E l..d, represent the constant cross section of the model, width = coi2coil the width of the model and IF2 = >Jj-_coll ,I) the horizontal projections of the image pixels in each band confined between columns coil and coi2. This gives a number of operations equal to width x ROW + 2d x width x ROW + 3d x ROW. For typical model sizes, the number of operations in formula (2) is half of the number of calculations involved in formula (1).
As the efficiency was not yet satisfactory, a completely l-D analysis was devised. This is achieved by considering the average of the pixel values on each row of a band as the corresponding pixel value of the associated slice. Subsequently, the l-D version of the formula (1) is used to do template matching over each slice, taking as the model the constant triangular cross section of gray levels of the tubes. It can be shown that doing so reduces the computations involved by a factor of 3 with respect to formula (2).
3 Sampling and Filtering Three problems are addressed in this section. First, the question concerning the best locations of the bands of the image is considered. Second, the criteria for the detection of tubes in each band are discussed. Third, a procedure for a local estimate of the orientation of a tube is given. The importance of this local estimation in terms of robustness and efficiency is shown in section 4. Fig. 3 depicts the Peak selection and angle estimation stages.

Sampling
The following factors should be considered in the selection of the band width. The larger the width of a band, the less efficient the algorithm becomes. Also, wider bands are more sensitive to rotation of the tubes. Their only advantage is the increased immunity to noise. In narrow bands less signal is present, jitter noise and other undesirable effects can dominate the gray-level response of a tube. In this system, 3 to 5 pixels is considered as the best choice for the band width. The positions of the bands are chosen by the following criteria. (1) They should be located in predefined zones.
(3) The filter response to the background sampled by these bands should be minimum with respect to all other possible samplings which fulfill condition (1) and (2).
Conditions (1)- (3) imply that oniy well-spaced, least noisy bands are selected to sample the image. In this system, samples are selected mostly in the portions of the image where the floor is not visible (i.e. on the rotating wheels of the conveyor). Hence, the noise coming from the factory floor is almost eliminated.
In spite of the careful election of the samples, the filter response to the background in some points of the chosen slices can be still quite high (the highest filter response to noise can be higher than the acceptance threshold). That implies that the filter response to the image is not enough to determine the presence of a signal (i.e. a real tube). It will not be clear if a given weak response corresponds to a dirty tube, or to some noise in the background. The following subsection describes how to overcome this difficulty.

Filtering
In order to perform template matching, the model size must be known in advance. This is dictated in this case by the diameter of the tubes which is readily available to the operator.
A Peak is defined to be a 3-tuple with the following components: a) coordinates of a local maxima of the filter response to an image slice with a value above a given minimum. This threshold is called the acceptance threshold and is the lowest value which the filter response of a tube can take. The X-coordinate of the Peak is determined by the slice position, and the Y-coordinate by the local maxima position in the slice; b) a reliability factor which is a measure of how much credit is given to each Peak; c) a local estimate of the orientation of the tube in the neighborhood of the Peak. The height of a Peak is defined as the filter response at that point of the slice.
The filter is run over the chosen slices in the image and also at the same positions in the background. The filter response of the image slices are then analyzed and the Peaks detected. Fig. 3 (a). The reliability factor for each candidate Peak is a real function of two variables: the height of the Peak and the ifiter response to the background on the same location. In this system, this fuction returns the difference between the two values. From the set of original candidate Peaks, only the ones with a reliability factor greater than a predefined threshold are considered for further processing.
Once the "true" Peaks are determined, what remains is the local determination of the orientation of the tubes which is the subject of the following subsection.

Angle Estimation
This estimate (eest) must be very robust since the Peaks are mapped to the Hough accumulator array only for the values of the angle in the range 9est degrees. In this system, the angle is estimated with a confidence of degrees (i.e. q5 = 2).
The uniformity of the gray values along the central bright line of a tube is used as a criterion for a rough approximation of the orientation of the tube. Once being close to the true angle, the average gray values of the pixels is used to find an estimate of the angle. This is because the central line of the tube is relatively Figure 4: Real tubes with detected Peaks superimposed. "sticks" of the Peaks indicate the estimated orientation much brighter than its neighborhood.
To implement the above ideas, the neighborhood of a Peak in the image is scanned with a digital line 60 pixels long whose midpoint is the position of the Peak, fig. 3 (b). This line is rotated at angular steps of 5 to 10 degrees. The scan range is determined by the possible tube orientations, in this case 10 to 170 degrees. For each given position of the line, two features are extracted: the average and the variance of the gray levels of the pixels contained in the digital line. When the digital line is far from the true angle, most part of the line is off the tube under consideration and the gray values are very non-uniform, for a rough estimate of the angle, the variance of the gray level distribution is adequate. However, that alone is not a very accurate measure of the orientation, since the central line of the tube frecuently has dark spots caused by dirt and other contaminations which spoil the uniformity of the pixel gray values. The average gray values of the pixels of the line is useful in refining the estimate. After a rough estimate of the angle using the variance feature, the neighborhood of this candidate angle is inspected and the one with maximum average gray value of its digital line is chosen. Interpolations are done to estimate the angle with a precision several times higher than the rotation steps.
At this stage, all of the components of the true Peaks are obtained. Fig. 4 shows real image of the tubes with detected Peaks (black points) superimposed. The "stick" of each Peak indicates the local estimation of the orientation of the tube. Subsequently, the Peaks are mapped to the Hough space.

Hough Mapping
The Hough transform [10,11,12] provides a technique for obtaining the parameters of a model given a set of points that include instances of the model. Common uses include determination of the parameters of a straight line or a circle, but it may be extended to arbitrary shapes [12]. In the following, the use of the Hough transform in this system is described.
In principle, a decision could be made about the number of tubes by simply considering the number and the position of the Peaks in each slice. For example, the peaks in each slice could be counted, and the median over all the slices could be taken to be the number of tubes. The reasons which make the Hough mapping an attractive method are the following. First, it is easy to determine the positional anomalies of the tubes when the (p, e) of the tubes are given, while without this information the task is very complex. Second, it may happen that in each slice, a few true Peaks get lost due to adverse imaging conditions, causing the missing of one or more tubes. For this to happen in the Hough method, it must be the case that in majority of the slices, the Peaks corresponding to exactly the same tube are missing, not just any Peak. This can occur when a tube a detected Peak coHnear Peaks (false tube) Figure 5: A false tube detection when no tube orientation estimate is available a tube is literally not visible throughout its length, a condition with extremely low probability of occurence.
It shall be shown that in order for the Hough-based method to be effective, a priori knowledge about the orientation of the tube to which the given Peak belongs is essential. Suppose that a conventional method of mapping a set of points to the Hough space were used. For each Peak, a wide range of possible angles (in this case 10 to 170 degrees) would have to be scanned and corresponding p's for each angle calculated. The most obvious disadvantage is computational cost, since aside from Hough array accumulation, all subsequent operations such as smoothing, and connected component analysis of the Hough acumulator array would have to be done in a matrix whose 9 dimension is very wide, while in this system, the range of possible Os is limited to the lowest and highest estimated angles. This fact, coupled by the minimum and maximum values of the calculated ps defines a region of interest in the Hough array for the subsequent processing. Consequently, the number of operations and storage is substantially reduced without the need for any particularly efficient implementation of the Hough array [13]. Aside from efficiency concerns, a very serious problem arises with traditional Hough mapping when multiple lines are to be detected. Fig. 5 shows this problem. As it can be observed, false tubes could arise. It can also be seen that this problem can be overcome using the estimated angle, since Peaks are mapped only for values of 9 in a neighborhood of that value. Given the Peaks, there are three steps involved to extract from them the parameters of the lines, namely Hough array accumulation, smoothing, and connected conp onent s analysis.

Rough Array Acumulation
It was observed that for this particular problem, quantizing the Hough space with M = 1 degree and Lp = 1 pixel gives the best results. The accumulator cells are incremented by an amount proportional to the reliability factor associated to each Peak, this way the Peaks with more credits have higher participation in determining the parameters of the lines.

Smoothing
Mainly due to noise and quantization in 9 and p space, there is a kind of dispersion of a local maxima corresponding to the parameters of a true line [14,15]. A rough smoothing is done to make sure that very close local maximas are unified. This is done by a standard filtering operation in Hough space which assigns to a pixel the maximum value of the pixels in a neighborhood. The size of the window is a compromise between the expected minimum distance between the "hills" in the Hough accumulator array, and the degree of expected dispersion of a hill. It was found that a window of width 5 performs best in all images. As no sub-bucket accuracy is needed, sofisticated smoothing and interpolation procedures [15] are not necessary.

Connected Component Analysis
The problem at hand is to reliably segment the hills in the accumulator array. This is done by first thresholding the acumulator array with a global threshold, for some other applications more sofisticated thresholding techniques may be more convenient [16,17]. The thresholding procedure assigns a zero whenever the value of an accumulator cell is below the given threshold but maintains its value otherwise. This assures that the shape of the hills are preserved, hence a more accurate estimation of the centroid is achievable.
Once thresholding is done, the labeling algorithm described in [17] is run over the array, and subsequently the centroids of the regions are calculated. These points determine the (p, 9) of the central lines of the tubes. Fig. 6 shows the detected lines superimposed on real image of the tubes. To report any anomalies, the equations of the lines are solved to find all the intersections. Too slanted tubes, fig. 2 (c), are detected considering the 0's of the lines.

Simulation and Experimental Results
Experiments were performed using simulation images and an extensive number of real images to test the accuracy of this method (for definitions of accuracy and precision see [18,19]). The errors in the estimation of the orientation and position of the tubes were taken to be the absolute values of the difference between true and estimated orientations and positions of the tubes. In the case of the simulated images, the exact value of the true (p, 0) of an ideal tube is known in advance. However, in the case of the real images, the (p, 0) of areal tube is estimated manually with expected precision of about 1 degree in orientation and 1 pixel in position.
A measure of the signal to noise ratio (S/N) is also given. This is taken to be the ratio between the height of the lowest true hill and the highest false hill in the smoothed Hough array. False hills can arise as a result of detection of false Peaks. These Peaks are also mapped to the Hough array and depending on their reliability value can participate actively or not in the accumulation of the array. Too many colinear false Peaks can cause a high false hill to arise in the Hough array. The S/N as above is considered to be a good measure of the SPIE VoL 1708 Applications of Artificial Intelligence X: Machine Vision and Robotics (1992)1491

Simulations
In the simulations, 500 images each containing in average 12 tubes were created using a real background image and generated "ideal" tubes. Ideal tubes are constructed such that their cross-section is uniform and identical to the model. Fig. 2 shows these tubes on real background. The simulated images are mainly used to test the algorithm in the presence of positional anomalies (e.g. superpositions) which are difficult to find in practice. The simulated tubes have varying diameters, positions and orientations.
The result of running the algorithm over simulated images is the following. Accuracy in the measurement of the p and 0 of the tubes was better than 1 pixel and 1 degree respectively. The detection reliability was 100%, No false alarms.

Experiments on Real Images
A set of over 70 real images each containing 12 tubes approximately were used to test the algorithm. The images were deliberately taken in the most adverse conditions: saturated images, out of focus sensing, presence of background and random noise, strong illumination variations (e.g. presence of sun light as stray light or total absence of ceiling light as the other extreme), and real tube superpositions. The result of the application of the algorithm on these real images is the following. Accuracy in the measurement of the p and 9 of the tubes was better than 2 pixel and 3 degrees respectively. The detection reliability was 100%. However, because of the presence of a few faint false hills in the Hough accumulator array, the S/N was observed to be 30dB in the worst case, no false alarms. Fig. 7 shows a portion of the smoothed Hough array for the worst real image. The false peaks do not have sufficient relative height to be observed in the figure.
6 Conclusions and Future Work A robust tube detection prototype system has been presented. In this system, the a priori information available through the shape of tubes was successfully exploited leading to a reduction from 2-D to 1-D analysis. Comparing the number of operations involved in a fully 2-D analysis, and, the total number of operations in this system, it can be shown that an speedup by a factor of four orders of magnitude was achieved without increasing the error rate. As a detector, the use of normalized correlation filter yielded consistent results. The local information supplied by the Peaks was accumulated by the Hough array to yield a robust estimate of parameters of tubes.
The problem of detecting perfectly mounted tubes was not yet solved, fig. 2 (a). In order to detect this anomaly a preprocessor module is needed which uses a model capable of detecting this particular positional configuration of the tubes.
A possible extension to this system is the incorporation of the algorithms to measure the length of the given tubes in the same location where they are counted. As the typical lengths of the tubes are 8 to 12 meters approximately, the capture of the totality of a tube is a problem. Several cameras and/or special arrangements may be required.
In order for this prototype to be considered as a fully working system, the algorithms must be tested with a very large number of real images (i.e. over 1000 images each containing 12 tubes in average).