Detecting Anomalies of events

Discussion:

(too old to reply)

CMOS

2005-09-19 11:08:53 UTC

hi all,
i need to build a system which will find any anomalies of a particular
activity. The system will be feeded with events that are happening and
it should be able to find any significant deviation from the normal
operation. the activities involved are divarse, but so if i can come up
with a generic system, i might be able to use in all occations. Some
examples of the activities i need to monitor are :
1 ) webserver access
2 ) sales
3 ) access to a particular table in a database
4 ) access patterns of various tables in a database, etc
5 ) access ( login ) patterns, frequency to a particular system

this list will be growing.

basically i need to be alerted on

deviation of events's frequency from normal
deviation in pattern of events that are happening, etc

i wonder whether there is any area in mathematics / computer science
which deal with that kind of problems. So i really appreciate if some
one can suggest me a good path for the project.

thank You.
CMOS

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Kheor

2005-09-20 15:12:39 UTC

Permalink

SOM networks allow you to detect this kind of Outlier on a Map. In
fact, we use Som in fraud detection really often.

Hope that helps.
Ps: read unsupervised learning on FAQ

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

--CELKO--

2005-09-20 15:13:00 UTC

Permalink

Look up products like CoreMetrics that collect website data and see
what they have.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Ted Dunning

2005-09-22 08:53:03 UTC

Permalink

The essence of the problem is that you have a classification problem
with training examples for all but the class of interest.

Viewed this way, the problem reduces to building probability models for
known cases and the unknown case. Since you have little or no data for
the unknown (anomalous) case, you have to make some strong assumptions.
Ultimately, you may be able to collect putative data from the
anomalous case, but you really can't depend on that being possible. At
most, you only get enough data to slightly constrain the posterior
distribution of the anomalous case.

Take for example the ultimately simple case of normally distributed
events with normally distributed anomalies. You know (in this example)
that the probability of an anomalous event is less than 1%.

Let us assume that from domain knowledge, you know that the mean of the
distributions is probably less than 10 and the standard deviation is
less than about 10. It is pretty easy to come up with conjugate prior
distributions that encode this knowledge.

If you take some number of training examples, then you can get a pretty
good posterior distribution for the non-anomalous case since you know
that at most about 1% of the training examples will be anomalies. In
fact, you can train a mixed Gaussian model on your data to get an even
better model. The model for the anomalous case will be largely
undetermined by the data and thus will be dominated by the prior
distribution.

In this framework, it is pretty easy to get a posterior probability
that each new data point is an anomaly or not (especially given that
each point must be one of the two) by integrating over all possible
parameter values. Obviously, these posterior estimates depend pretty
critically on the prior distribution of the anomalies. The wider you
choose the prior to be, the more extraordinary a point must be before
considering it to be an anomaly. The good news is that with a
Gaussian, your tails drop so sharply, you will do pretty well once you
have seen a few anomalies.

A previous poster claimed to use SOM's for this sort of problem. SOM's
may be a mildly interesting way to estimate complex probability
estimates, but I have found that actually analyzing your problem more
carefully generally leads to better solutions.

I would be very interested to hear of specific examples of production
systems that actually use SOM's for fraud detection. None of the
systems that I have designed for that task, nor any of the ones that I
am familiar with actually use SOM's for fraud detection. Anomaly
detection is an important step in these systems because fraud is so
commonly under-reported in the real world, but I haven't seen any SOM's
used in anger in these systems.

I would love to hear otherwise.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

f***@gmail.com

2005-09-26 10:45:45 UTC

Permalink

I think looking at how policy-base systems operate might help you. In
such systems policies are set of rules that control the behavior of the
system. They can either be written in a formal logical languages or
some semi-formal (xml) languages that eventually are implementable. In
that way one can write rules that define abnormal situations and the
system can fire customized actions accordingly.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Ted Dunning

2005-09-28 01:47:05 UTC

Permalink

The suggestion to hand-write rules to define abnormal situations pretty
much describes the state of the in fraud detection 15 years ago.

Adaptive techniques, notably including anomaly detection for the first
few generations completely destroyed hand-written systems, especially
in credit card fraud. The problem isn't so much that the hand-written
systems don't work initially, but that they decay over time. My term
for the phenomenon is bit-rot. As new rules are added and as the world
changes, the hand-written rules systems eventually produce really,
really bad results.

That said, rules do have their place. The feature detectors that go
into adaptive systems are definitely rules of a sort (but they don't
themselves predict fraud very well) and the business logic that
determines what actions to take are also rules. Rules match a
regulatory environment (as in, don't call a customer about suspected
fraud more than once in any 90 day period). Adaptive learning methods
work much better at balancing the risk of error and the value of
detection.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

lemon

2005-09-29 00:41:47 UTC

Permalink

Hope you can get some insight from this one...they basically classify
sequences of visited webpages into buyer/non buyer, using a SOM. The
input for the SOM is a matrix expressing the sequence.

"M-SOM-ART: Growing Self Organizing Map for sequence clustering and
classification"
Zehraoui, F., Bennani, Y.

Regards

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Kirt Undercoffer

2005-10-18 03:14:41 UTC

Permalink

Post by CMOS
hi all,
i need to build a system which will find any anomalies of a particular
activity. The system will be feeded with events that are happening and
it should be able to find any significant deviation from the normal
operation. ...
1 ) webserver access

...

Post by CMOS
5 ) access ( login ) patterns, frequency to a particular system
...
basically i need to be alerted on
deviation of events's frequency from normal
deviation in pattern of events that are happening, etc
i wonder whether there is any area in mathematics / computer science
which deal with that kind of problems. So i really appreciate if some
one can suggest me a good path for the project.

You have made a mistake common to this kind of problem.

Namely you are assuming that derivations from an observed norm is an
error.

With the domains listed you can get away with this. But there are
domains where this assumption will be false. Specifically this
assumption does not necessarily hold for computer log file analysis or
in fact in any domain in which misconfiguration and/or malfunctions
(i.e. bugs requiring patches or hotfixes applied to a large number of
systems [like the endless stream of hotfixes that Microsoft was putting
out before they started staging their hotfixes as upgrades]) are very
common and can be seen as normal. In fact with such systems outliers
can actually be correctly configured systems! With these sorts of
domains the problem is compounded because there isn't generally a
universal baseline that can be established because different systems
have differing configurations because of different needs in different
environments.

Regarding even the domains listed, access is handled best by policy and
policy isn't always reflected in actually access patterns. If you have
a large number of users and there have been long-term undetected
intrusion, then simply looking for outliers alone will not detect the
intrusion because the currently undetected intrusion is already
established as the norm (an extreme but possible condition). Outlier
analysis like this can only detect new intrusion. So consider adding
policy checking.

BTW - these issues came up in a real life system I was working on where
people from a major defense contractor made the same naive assumption.

Kirt Undercoffer

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Kirt Undercoffer

2005-10-18 03:15:09 UTC

Permalink

Take a look at :

James H. Andrews and Yingjun Zhang, General Test Result Checking with
Log File Analysis, IEEE Transactions on Software Engineering, v. 29,
no. 7, July 2003, pp. 634-648.

http://www.csd.uwo.ca/faculty/andrews/papers/index.html

Now this isn't going to be appropriate for most of the domains you have
listed but it can be useful when dealing with problems including some
kind of log analysis (and several of the problems you have listed can
be addressed through log analysis).

Kirt Undercoffer

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]

Ted Dunning

2005-10-24 10:50:19 UTC

Permalink

I re-read the original posting and realized that I (and the other
posters) had missed the fact that all of the examples were cases of
time-embedded events with non-constant frequency.

As it turns out, I have done a fair bit of work on this and should have
been able to give a much better answer.

The simple and practical answer is to view each of the types of even as
a Poisson process with a non-linear time warp. It is very easy to
build alarms for Poisson processes since you can build a model that has
specified false alarm/missed event rates just by looking at the delay
since the last event. Some preprocessing may be necessary if accesses
from a single source are clustered as would often be the case with a
database.

Finding the correct time warp is as simple as estimating the average
rate of events. For web sales, this is as easy as building a model
with time of day, day of week and a holiday flag. Day of week is often
represented simply as a weekend flag. Generalized linear models are
really the right tool for this, but you can just do hourly average
rates on the kinds of days and be pretty much in business (with a
little bit of linear smoothing). I have built activity level alarms
based on this approach that worked very well indeed.

Sorry for being dense.

[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <comp-***@moderators.isc.org>, and ]
[ ask your news administrator to fix the problems with your system. ]