Statistical Segmentation of Geophysical Log Data

Stationary segments in well log sequences can be automatically detected by searching for change points in the data. These change points, which correspond to abrupt changes in the statistical nature of the underlying process, can be identiﬁed by analysing the probability density functions of two adjacent sub-samples as they move along the data sequence. A statistical test is used to set a signiﬁcance level of the probability that the two distributions are the same, thus providing a means to decide how many segments comprise the data by keeping those change points that yield low probabilities. Data from the Ocean Drilling Program were analysed, where a high correlation between the available core-log lithology interpretation and the statistical segmentation was observed. Results show that the proposed algorithm can be used as an auxiliary tool in the analysis and interpretation of geophysical log data for the identiﬁcation of lithology units and sequences.


Introduction
Segmentation is an important data mining process.One important application is the identification of locally stationary intervals or, equivalently, the location of change points.In this context, segmentation (also known as zonation) is the dividing of a sequence into relatively homogeneous and stationary intervals such that each segment is distinctive from the adjacent ones.Well logs can be subdivided into relatively uniform segments that represent zones of similar lithologic character (stratigraphic units and formations).Segment boundaries correspond to abrupt changes in the layering and conform the limits of relatively stable periods or geologically meaningful zones.These elementary units of similar properties can then be used as the basis for inferring correlations between wells.A different approach consists of blocking or filtering the data to get a simpler approximation (e.g.piecewise constant segments).This segmentation problem will not be considered here, and the reader is referred to, for example, Kaaresen and Taxt (1998) and the references therein for details.In this work, the focus is on the identification of statistically distinct intervals in the log sequences.
There are various strategies for addressing this segmentation problem.Classical approaches include the detection of abrupt changes in the mean (Webster 1973) or in the variance (Gill 1970;Hawkins and Merriam 1973).A general description of these techniques is given in Davis (1986).Recent studies include zonation by means of cluster analysis (Gill et al. 1993), spectral analysis for identifying stationary intervals (Ligges et al. 2002), etc.The method presented here takes into account both the mean and the variance, and also higher-order robust statistics such as certain non conventional skewness and kurtosis measures (Velis 2003) to identify change points.Essentially, a split window is moved along the sequence and the probability density functions (pdf) of the two adjacent half-windows are compared.When a significant difference is detected, a change point is identified.Smooth pdfs are estimated using the maximum entropy method as described in Velis (2003), which guarantees robustness when dealing with short data sequences.Finally, a criterion for deciding which is the number of segments that comprise the data is proposed.
The effectiveness of this strategy is supported by the analysis of various examples using simulated and real data sequences derived from well-logs.The real data comprises various borehole measurements which are part of the Ocean Drilling Program, Leg 197, Site 1203(Tarduno et al. 2002).At this site, the lithology interpretation based on extensive core samples and logging data analysis was available, so it was possible to make a comparison between this interpretation and the statistical segmentation.Results show that there is a high correlation between the published core-log lithology and the segmentation generated by the proposed statistical procedure.

Let E
x = (x 1 , x 2 , . . ., x N ) be the sequence of well log data.The objective of the segmentation process is to subdivide this series into smaller segments so that each interval is relatively locally stationary.That is, we look for a sequence of change points which satisfy These indexes determine a set of M − 1 segments of length (3) In practice the algorithm proceeds iteratively by searching successive change points {t j } based on the assumption that two adjacent intervals are distinct when the pdfs of the data on each side of t j are significantly different.For this purpose, a split window of length 2L is centered at location t j , and the corresponding pdfs are estimated and compared appropriately.
Here, L should be short enough to allow for the identification of short stationary intervals.Thus, a robust pdf estimation method that works well even for short data sequences is required.The maximum entropy (MaxEnt) method with moment constraints described in Velis (2003) produces smooth non-parametric pdfs which are consistent with the data.The approach utilizes robust statistics computed directly from the data to constrain the maximization of the pdf entropy.These statistics (called S-measures) involve the non-conventional skewness and kurtosis indices that measure shape and proved to be appropriate to identify the main features of the distribution of primary reflection coefficients (reflectivity).In exploration seismology, the reflection coefficient is the ratio of the amplitude of the displacement of the reflected wave to that of the incident wave.The reflectivity is one of the components, together with the seismic wavelet, of the so-called convolutional model of the seismic trace, especially valid for layered geological models (Yilmaz 2001).
The strategy to carry out the segmentation is based on the sliding window approach, which consists of moving the analysing window along the whole sequence and assigning a change point when a significant difference between the pdfs is observed.To avoid the assigning of change points which are too close, we found it more appropriate to look for a single change point at a time.Starting with j = 2 (recall that t 1 = 1), we look for optimum change points until the next change point that is added does not yield a significant difference between the adjacent pdfs.These optimum change points correspond to the smallest probabilities along the whole sequence for the current iteration.

The Algorithm
Let t be the current estimate of the j th change point.Let E u = (x t−L , x t−L+1 , . . ., x t ) and E v = (x t , x t+1 , . . ., x t+L ) be the two subsets of E x spanned by the split window, and let pu (E u) and pv (E v) be the corresponding estimated pdfs, which are to be compared.Rather than measuring the difference between pu and pv , we measure the difference between their respective cumulative distribution functions (cdf), Pu and Pv , using the Kuiper test (the cdfs are calculated numerically by integrating the pdf estimates).The Kuiper test, a variant of the well-known Kolmogorov-Smirnov test (Press et al. 1992), quantifies the difference between two cdfs.The Kuiper statistic is given by where a and b define the region of support of the cdf (usually the minimum and maximum values in the data set).It turns out that the distribution in the case of the null hypothesis that the two data segments come from the same distribution can be calculated asymptotically, giving rise to a formula that allows one to compute the significance level (Press et al. 1992) where The segmentation algorithm is a three stage process.In the first stage, the probability (5) is calculated for every possible change point location throughout the whole sequence in the range (L, N − L).In the second stage, change points candidates are added according to the following strategy: at the beginning, the point with the smallest probability is selected as a candidate for the first change point, yielding t 2 and the new segmentation (t 1 , t 2 , t 3 ), which is comprised of two segments of lengths T 1 and T 2 , respectively.Then, a new change point is added by selecting the smallest probability within the current longest segment (largest T j ), giving rise to a new partition (t 1 , t 2 , t 3 , t 4 ).This process is repeated and new change points are added (within the longest segments obtained so far) until all segments are shorter than a given minimum length, T min .
The third stage of the algorithm consists of discarding those change points whose associated probabilities are larger than a predefined threshold.Also, the change points with largest probabilities in excess of a predefined number of change points are deleted.Note that a large probability is indicative of a high degree of confidence on the null hypothesis that the two distributions are the same, so low values of probability are desired to obtain a high confidence on the hypothesis that the two distributions are different.To avoid too fine segmentations (i.e. two change points separated by a few samples), a minimum separation Δ between two consecutive change points is forced by adjusting the search range accordingly.
Step by step, the algorithm is as follows.
1. Set j = 1. 2. For every t in the initial search range (L, N − L): Estimate pu and pv using the MaxEnt method.(c) Estimate Pu and Pv by numerical integration.(d) Compute V and evaluate the probability (5). 3. Set j = j + 1. 4. Find the smallest probability within the current search range to get a new optimum change point t j . 5. Sort, in ascending order, the current set of change points and update the segmentation (t 1 , t 2 , . . ., t j , t M ). 6. Compute segment lengths according to (3). 7. Update the search range: (t j max + Δ, t j max+1 − Δ), where t j max is the beginning of the longest segment, T max .8. If T max > T min , go to Step 3. 9. Delete all change points whose probabilities are larger than a predefined significance level.10.Delete all change points in excess of a predefined maximum number of change points whose probabilities are largest.

Test Results
To check the consistency of the segmentation algorithm, we applied it to the simulated sequence (8400 random values) shown in Fig. 1.The sequence was generated by concatenating samples drawn from eight different non-parametric distributions.These distributions were selected so as to simulate a realistic reflectivity sequence (Velis 2003) and are shown in Fig. 2.
In the segmentation process we set L = 250 and Δ = 200, and change points were added until no segment was larger than T min = 200 samples.At the end of the process, the change points with the associated probability larger than 0.01 were discarded.This significance level was chosen based on the inspection of Fig. 3, where the probability (5) was plotted in ascending order for all the identified change points.For values larger than about 0.01, the probability of the null hypothesis that the two distributions are the same increases rapidly.The estimated change points are shown in Fig. 1 and in Table 1, along with the correct change points.All eight segments were identified correctly.1 shows the exact location of the change points Fig. 2 Probability density functions used to generate the reflectivity sequence shown in Fig. 1 Fig. 3 Probability of the null hypothesis that the two distributions are the same.The plot reveals an abrupt change at about 0.01, a value which is selected as a threshold to discard change points with high probabilities in the third stage of the segmentation process Table 1 The eight segments used to build the sequence shown in Fig. 1  The next example shows the results of the segmentation process when applied to various geophysical logging data sequences.The data, which are part of the Ocean Drilling Program (Leg 197, Site 1203), were collected to characterize the southward motion of the Hawaiian Hotspot in the Emperor Seamount trend (Tarduno et al. 2002).The drilling achieved moderate basement penetration and high recovery allowing for a detailed lithostratigraphy analysis.Downhole measurements, which are of very good quality in the basement sections, included various standard and nonstandard tool strings and passes.
Figure 4 displays all data sequences used in the segmentation procedure.In particular, we selected total natural gamma ray, electrical resistivity, bulk density, porosity and S-wave velocity.The sampling interval is 0.1524 m, and each data sequence contains about 3300 samples in the interval considered.The last column of the figure, labeled "mean standardized log data", was built so as to take into account all logging data into a single sequence.For this purpose, the five previous sequences were standardized and averaged with equal weight and appropriate sign into a single sequence (the logarithm of the electrical resistivity was used in this sum.The polarity of both the total natural gamma ray and porosity was changed before the sum).The resulting "mean" sequence was the data that we actually used to carry out the statis- together with the unit number identification.Total natural gamma ray, electrical resistivity, bulk density, porosity and S-wave velocity were combined into a single sequence named "mean standardized log data" (see the text for details).The statistical segmentation algorithm detected the 32 change points indicated by horizontal lines (dashed lines correspond to change points that do not correlate with the available log lithology column).The same color code was used in all cases in order to facilitate the visual comparison between the available core-log interpretation and the results of the statistical segmentation tical segmentation.Note that this sequence exhibits features of the five previous data sequences, allowing for a full multivariate segmentation of the whole data set.
The results of the segmentation are shown in the same figure along with the available core-log lithology interpretation.The available interpretation in the analysed interval (400-900 meters below sea-floor, mbsf) comprises 30 units of alternating sediments (ooze and volcaniclastic) and basalts.The basement starts at about 460 mbsf.For details, see Tarduno et al. (2002).As for the statistical segmentation process, we set L = 31, Δ = 31 and T min = 50 (sample units) and we kept the 32 change points with the lowest probabilities.In general, the correlation between the statistical segmentation and the core-log interpretation is from good to excellent.All the major units were correctly identified.In particular, all the units identified in the available log interpretation were automatically detected by the statistical segmentation algorithm, except for the thin layers at about 475 mbsf (unit 2), 490 mbsf and 850 mbsf.In any case, these thin beds are not so clear by inspecting the considered log data.Actually, except for unit 2, the other two beds were not identified in the core lithology analysis.On the other hand, there is a total of 12 change points (denoted by dashed lines in the figure) which are not identified in the available log lithology interpretation.Some of these change points may be associated to units clearly identified in the core lithology.For example, units 27 and 28 are identified as a single unit in the available log lithology, but as two units in the statistical segmentation, in accordance to the core lithology analysis.The same can be said for units 19 and 20.Moreover, units 12, 13 and 14 are identified as two units in the available log interpretation, but correctly as three distinct units in the statistical segmentation.Finally, some of the remaining detected change points that do not correlate with the available core-log interpretation may be associated to a fine lithology layering (e.g.division of units into sub-units, etc.).In effect, the three change points detected in unit 4 by the statistical segmentation at 505, 515 and 524 mbsf, for example, correlate very well with core-lithology subunits 4i, 4k and 4m at 508, 516 and 523 mbsf, respectively.These subunits are not indicated in the core lithology column, and the interested reader can find detailed information in Tarduno, Duncan, and Scholl (2002).A deeper analysis regarding subunits is beyond the scope of this work.
Another point worth mentioning is the accuracy of the detected change points.The statistical segmentation procedure automatically detects very accurately the presence of a change point.Based on the information provided by the core interpretation, the thickness of unit 22, for example, is about 14 m.The same value is obtained after the statistical segmentation.On the contrary, the available log interpretation estimates a thickness of about 16.5 m.

Conclusions
The detection of stationary segments in geophysical log data sequences can be carried out in a quasi-unsupervised mode by searching for change points in the data.The MaxEnt method using robust non-conventional statistics that measure shape provides an appropriate technique to estimate the distributions that are to be compared.After estimating the distributions of the two halves of a moving window, abrupt changes are easily identified based on the analysis of the probability of the null hypothesis that the two distributions are the same.The Kuiper test proved to be a useful criterion to decide which change points lead to significant differences between adjacent distributions.This provides a means of choosing the appropriate number of locally stationary segments that the data sequence can be subdivided into.
The statistical segmentation algorithm presented in the work is viewed as an auxiliary tool that may contribute useful information in the identification of the main lithology units and sequences as derived from measured geophysical log data.The zonation of a borehole environment is an essential step in the correlation of subsurface layers between wells, with application in oil exploration and reservoir evaluation.

Fig. 1
Fig.1Simulated random sequence comprised of eight statistical independent segments.The segmentation is indicated by vertical lines: true (top) and estimated (bottom).Table1shows the exact location of the change points

Fig. 4
Fig. 4 Segmentation of the Ocean Drilling Program Downhole logs (Leg 197, Site 1203).The left panel shows the core and log lithology columns after Tarduno et al. (2002),together with the unit number identification.Total natural gamma ray, electrical resistivity, bulk density, porosity and S-wave velocity were combined into a single sequence named "mean standardized log data" (see the text for details).The statistical segmentation algorithm detected the 32 change points indicated by horizontal lines (dashed lines correspond to change points that do not correlate with the available log lithology column).The same color code was used in all cases in order to facilitate the visual comparison between the available core-log interpretation and the results of the statistical segmentation