Analyzing the past ... predicting the future
Last Modified: 22 July, 2001
Over the past several years, the Honeynet Project has been collecting and archiving information on blackhat activity. We have attempted, to the best of our ability, to log and capture every probe, attack, and exploit made against our Honeynet. This raw data has the potential for great value. We decided to share this data with the security community and demonstrate its value. We will focus on two areas. First, we intend to demonstrate how active the blackhat community can be. Regardless of who you are, you are not safe. Our goal is to make you aware of this threat. Second, to test the concept of Early Warning and Prediction. By identifying trends and methods, it may be possible to predict an attack and react, days before it happens. We test this theory using the data the Honeynet Project has collected.
The Collected Data
The Honeynet network, the network used to capture data, is a basic network of commonly used operating systems, such as Red Hat Linux or Windows NT, in a default configuration. No attempts were made to broadcast the identity of the Honeynet, nor was any attempt made to lure attackers. Theoretically this site should see very little activity, as we do not advertise any services nor the systems. However, attack they do, and frequently.
What makes Honeynet data even more valuable is the reduction of both false positives and false negatives, both common problems of many organizations. False positives are when organizations are alerted to malicious activity, when in fact there is nothing going wrong. When organizations are repeatedly hit with false positives, they begin to ignore their alerting systems and the data it collects, making the systems potentially useless. For example, an Intrusion Detection System mail alert administrators that a system is under attack, perhaps a commonly known exploit was detected. However, this alert could have been mistakenly set off by a user's email that contains a warning about known exploits, and includes the source code for the attack to inform security administrators. Or perhaps network monitoring traffic such as SNMP or ICMP had mistakenly set off alerting mechanisms. False positives are a constant challenge for most organizations. Honeynets reduce this problem by not having any true production traffic. A Honeynet is a network that has no real purpose, other then to capture unauthorized activity. This means any packet entering or leaving a Honeynet is suspect by nature. This simplifies the data capture and analysis process, reducing false positives.
False negatives are another challenge most organizations face. False negatives is the failure to detect a truly malicious attack or unauthorized activity. Most organizations have mechanisms in place to detect attacks, such as Intrusion Detection System, Firewall logs, System logs, and process accounting. The purpose of these tools are to detect suspicious or unauthorized activity. However, there are two major challenges leading to false negatives, data overload and new threats. Data overload is when organizations capture so much data, not all of it can be reviewed, so attacks are missed. For example, many organizations log Gigabytes of firewall or system activity. It is extremely difficult to review all of this information and identify suspect behavior. The second challenge is new attacks, threats that organizations or security software is not aware of. If the attack is unknown, how can it be detected? The Honeynet reduces false negatives (the missing of attacks) by capturing absolutely everything that enters and leaves the Honeynet. Remember, there is little or no production activity within a Honeynet. This means all the activity that is captured is most likely suspect. Even if we miss the initial attack, we still captured the activity. For example, twice a honeypot has been compromised without Honeynet administrators alerted in real time. We did not detect the successful attack until the honeypots initiated outbound connections. Once these attempts were detected, we reviewed all of the captured activity, identified the attack, how it was successful, and why we missed it. For research purposes, Honeynets help reduce false negatives.
The value of the data you are about to review is that both false negatives and false positives have been dramatically reduced. Keep in mind, the findings we discuss below are specific to our network, this does not mean your organization will see the same traffic patterns or behavior. We use this collected data to demonstrate the nature of certain blackhats, and the potential for Early Warning and Prediction.
Analyzing the Past
Post attack analysis:
Predicting the Future
One of the areas the Honeynet intends to research is Early Warning and Prediction. It is our intent to give more value to the data Honeynets collect by predicting future attacks. This theory is not new and is being pursued by several outstanding organizations. It is our hope that this research benefits and substantiates these and other organizations. Before explaining our methodology, we would first like to state that our research is still in its infancy, and requires more analysis.
Now, let's qualify that statement.
In an effort to predict trends, two members of the Honeynet Project took two different approaches. However, their findings were the similar, almost all attacks could be detected two to three days ahead of time.
Early warning using Statistical Process Controls (SPC):
The first was a very basic statistical analysis, similar to the statistical process control methodology used in the manufacturing world to measure defects in a factory setting. This method, although very simple, proved extremely accurate in providing short-term (three days or less), warning notice of impending attacks on the Honeynet. The basic process goes like this:
All calculations were performed without prior notice of attempted or successful attacks. Only after the control chart was calculated, were attempted and successful attacks plotted in the timelines. All data is available on the Honeynet site. Here are some of our findings:
The second methodology was used to validate the results of the first. We felt that it would be a useful exercise to look at the relationship between snort rpc rule violations and the number of days until system compromise. While a more proper time series model is in order, a very quick and preliminary look can be had using a simple predictive regression model regressing the frequency of a number of rpc rule violations on days until system compromise.
Figure 1 below reveals the predicted number of days until a system compromise with rpc.statd from this model. The horizontal axis represents the date, in days during the sample time, from 1-180. Downward spikes indicate significant activity, predicting an impending attack. This activity, is visible about 10 days before the actual compromise occurs on day 68. Again there are three downward "threat spikes" near the end of the chart before the system is again compromised by the same rpc attack on day 177. We have not yet confirmed what the upward spikes are, preliminary analysis suggests this is 'quiet time' or relatively safe periods.
While it should be cautioned that there are some statistical problems with the model - including a large Durbin-Watson statistic suggesting that there is some serious auto correlation that needs to be removed from the model - preliminary examination suggests that there are methods to warn of an impending attack several days before it happens. A more sophisticated time series analysis of this data in conjunction with other data would be most useful in further supporting the idea of early warning.
Examining Characteristics of Pre-Attack Signals using an ARIMA Model
Another area of investigation is to discern the characteristics of certain types of attacks and probes. This second example comes from one of the Honeynet Team's "Scans of the Month". Graph below portrays the number of port scans over a 30 day period. One of the questions we would like to answer is, "What is the typical period of time within which either an attack, further probing or a cessation of activities might be observed" for various probes and pre-attack behaviors. In this case a simple time series ARIMA (Auto Regression Integrated with Moving Averages) model was fitted to the data. ARIMA is a basic model used in time series analysis for looking at data collected over a period of time. The graph below demonstrates the frequency of port scans for the month of November.
We encourage the security community to test and develop these theories and perform their own statistical analysis. We are especially interested in any other types of analysis or finding people may find. What we have presented here is by no means an exhaustive analysis, rather this represents preliminary research. Linked below is the data collected and used by the Honeynet Project. This data represent eleven months of data, collected from April, 2000 to February, 2001. honeynet_data.tar.gz
During an eleven month period the Honeynet Project attempted to collect every probe, attack, and exploit sent against it. This data was then analyzed with two goals in mind. The first goal was to demonstrate just how active the blackhat community can be. The numbers demonstrate the hostile threat we all face. Remember, the Honeynet used to collect this information had no production systems of value, nor was it advertised to lure attackers. If your organization has any value, or is advertised in any way, you are most likely exposed to even greater threat. The second goal was to test the theory of Early Warning and Prediction. We feel there is potential in predicting future attacks. Honeynets are by no means the only method to collect such data, however they have the advantage of reducing both false positives and false negatives. Armed with data collection and statistical analysis, there is the potential for organizations to be better prepared against the blackhat community.