Finding similarity measures between arbitary complex object pairs is by no means an easy task. Today, I'd like to quickly review the approach of Bollabas et al. on time series. Clearly, given a fixed length time series, we can encode it as point, and therefore compare two times series by using, say, the Lp norm. The problem is that time-series are prone to outliers, may have different scaling factors, etc.
The approach in http://citeseer.ist.psu.edu/bollobas98timeseries.html
is to use the longuest common subsequence matching to define and integer similarity measure. We seek for two monotonous index sequences in the time series that match up to epsilon in value. Also we allow for some drift amount of the indices in the two series. This end-up with an O(n^6) algorithm, far too costly.
So the beauty of the paper, is to map these times series into plane and study the well-separeness property.
To sets S1 and S2 are said k-separated iff they symmetric difference is of cardinality at least k. In the context of geometric line segments, one can show that the maximum size of any k-separated line segment is n^2/k^2 (and not 2^something). Geometric configurations therefore drastically reduces the size.
It turns out that with random sampling, one can build an efficient linear randomized algorithm that performs well in practice (with accuracy within 10% of the exact measure). I enjoyed the geometry mapping and study approach of this paper and highly recommend reading this mere 4-page paper.
Of course, it is a bit oversimplistic approach of the problem that is usually tackled statistically (source+noise) using moving average techniques.
Finding similarity measures between arbitary complex object pairs is by no means an easy task. Today, I'd like to quickly review the approach of Bollabas et al. on time series. Clearly, given a fixed length time series, we can encode it as point, and therefore compare two times series by using, say, the Lp norm. The problem is that time-series are prone to outliers, may have different scaling factors, etc.
The approach in
http://citeseer.ist.psu.edu/bollobas98timeseries.html
is to use the longuest common subsequence matching to define and integer similarity measure. We seek for two monotonous index sequences in the time series that match up to epsilon in value. Also we allow for some drift amount of the indices in the two series. This end-up with an O(n^6) algorithm, far too costly.
So the beauty of the paper, is to map these times series into plane and study the well-separeness property. To sets S1 and S2 are said k-separated iff they symmetric difference is of cardinality at least k. In the context of geometric line segments, one can show that the maximum size of any k-separated line segment is n^2/k^2 (and not 2^something). Geometric configurations therefore drastically reduces the size.
It turns out that with random sampling, one can build an efficient linear randomized algorithm that performs well in practice (with accuracy within 10% of the exact measure). I enjoyed the geometry mapping and study approach of this paper and highly recommend reading this mere 4-page paper.
Of course, it is a bit oversimplistic approach of the problem that is usually tackled statistically (source+noise) using moving average techniques.