There are many ideas about what to do when a system compromise is detected. There isn’t really an easy answer; as with all types of contingencies the response depends upon the situation. Risk management is the key to conducting an effective response. To do so it is important to be quick to respond, and to be effective in the response. How effective you can be though will depend upon your organization’s understanding of the risk(s) involved.
Computer Security Incident Response Programs (CSIRP) and Teams (CSIRT) are functions of Enterprise Risk Management tasked with IT-related response planning, testing, and execution of response activities and investigations. They are usually a component of the IT organization but often support Legal or other Risk Management functions. Enterprise IT will generally have a computer security risk management policy that says something like – “assess the compromise type and take action accordingly”. In more mature contingency management programs (sometimes part of CSIRP but more commonly part of BCP/DR programs) a risk evaluation criterion will be used to differentiate risks from threats.
In the public sector that essentially means differentiating between which activities can be prevented, versus what activity should be captured and investigated. In the private sector the risk threshold is different, in effect relating “cyber-security risk management” to commercial considerations of what will affect the business – health and safety, financial, competition, brand, legal, and operational issues. Both public and private sector organizations are concerned with these issues but a fundamental difference is the ability of the organization to investigate – in terms of resources, methods, and legal jurisdiction to do so.
The process and methods of investigating a cyber-security breach incident are very dependent upon accessibility to related systems. When you have physical access to a computer it is relatively simple (technically) to collect the artifacts necessary to conduct an analysis or investigation; however it is more complex in terms of legal considerations. “Stateful” and volatile data are important, but whether or not forensic quality is required is a risk management policy issue of the organization that should guide the organization’s approach. The policy should of course be informed by “best practices”, but those include legal, operational, AND technical best practices - particularly when there is a question of HR or other legal ramifications.
Figure 1: Risk & Contingency Management Focus
But what if you don’t have physical access? And what if you don’t even know which specific computer is exhibiting suspicious behavior(s)? How do you address that, and what costs are involved – including capital and operational expenses of the technical response, but also business interruption, and jurisdictional or other legal costs (such as data transfer rules for privacy in different regions)? This is the situation that most organizations are faced with. An economic, but effective and scalable, approach to Cyber Security Incident Response is needed. Our approach offers a framework for related contingency management policy in an organization. We call it “Presponse”.
We should start with a philosophical statement. An indicator of compromise is not a “threat” indicator. It is a “risk” indicator. It can certainly indicate the use and utility of an application or a system that can threaten a business, but the actual threat is determined by when and how it is used and what impacts are possible or demonstrated. In effect, every indicator has some risk and some correlated threat associated. As analysts we can provide information related to our interpretation of available evidence and our experience with similar events to help a customer interpret where on the risk-threat continuum an indicator may be relevant; and in doing so help to identify other related activities to discover or hopefully impede further compromise activities.
When some indicators of compromise are uncovered in a cyber-security event it is fundamentally important to understand the scope of suspicious behaviors related to the activities. Simply responding to unique indicators is not sufficient to conduct a risk analysis and identify suspicious activities. A risk analysis though is crucially important to organizations in order to determine the appropriate structure, communications, and response to the situation. Indicators of compromise can lead to a computer or perhaps several, but from there it gets more complicated – and can be accordingly more expensive if an efficient framework for discovery and investigation is not utilized.
There have traditionally been two divergent CSIRT approaches: 1) detect, respond, and resolve; or 2) detect, investigate, plan, respond, and remediate. The missing component in those traditional approaches though is a suitable diagnosis, or determination of the scope of compromise and interpretation of the impact. This is an intrinsic concept in organizational risk management (for financial, brand, legal or other impacts) – but a differentiating feature of CSIRP versus other types of contingency management activities that we have observed. Sometimes the urge to “stop the bleeding” overtakes the methods of risk management policy. With an adaptive and efficient framework of response however, it is possible to do both – in fact the military has demonstrated this for hundreds of years as a battlefield medical response procedure called “triage” where primary (critical), secondary (impactful), and tertiary (remedial) cases are handled according to assessed priorities of care.
Triage depends upon a systemic awareness of related conditions and factors to diagnose:
- Those for whom care may make a positive difference in the outcome
- Those who are likely to live, regardless of the care they receive
- Those who are likely to die, regardless of the care they receive
In CSIRT contingency management activities Triage can be interpreted to mean:
- Identify those systems that demonstrate specific indicators of threat behavior (such as sabotage, data loss, or user profile abuse) and that may reveal information useful to impeding or preventing further compromise
- Identify those systems that exhibit other attributes or traits of compromise (such as Trojan dropper/downloaders, backdoors, or botnets)
- Identify those systems that have other anomalies or may exhibit risk factors that could be used to compromise the environment in the future (such as general malware, or build discrepancies)
In order to diagnose an incident, it is necessary to have an understanding of the potential scope of compromise. This is accomplished by using available information to identify “systems of interest” based upon behavioral abnormalities. For example, if a suspect FQDN (such as “apt.dyndns.com”) is known then assessing the scope can be as simple as reviewing DNS logs to identify systems that have requested NS resolution; however it should be more inclusive to at least identify IP addresses that are related to that FQDN, and correlated ASN-registered IP addresses, and Firewall, Proxy, and HIPs/NIPs logs reviewed for similar indicators. Often the identification of announcing endpoint hosts is stymied by DHCP or internal routes that obfuscate the actual host address though, which means that host information needs to be collected and examined as well for the diagnosis to be effective.
There's More To It Than Malware
In any incident response, communications are the key to effective incident handling. Triage is an important component of that effort – identify what information is most important and communicate it to the appropriate levels for action.
Incident responders are often influenced by information that has sensational value, beaconing services that do not conform to standard builds, SQL injection strings in Web server access logs, virus detections by A/V or A/M, and etc. More often they are given pieces of information by users, management, or external third parties (such as vendors or law enforcement) concerning anomalies that are believed to represent a current “threat”. These are important data points in a risk management program, or in an incident response, but a sense of organization in the effort is crucial to developing an efficient workflow in the incident handling procedures – and to support communications of the appropriate information to the appropriate organizations for handling. Context thus becomes important.
A risk is not the same thing as a threat. A risk is something that could happen based upon circumstances or opportunity, a threat is something that has or could have a demonstrated impact on your business or operations. This is a simple philosophy of risk management. Don’t think that the sky is falling just because of a raindrop.
Cyber security risks include things like poor ACL configurations, unpatched systems, zero day exploitations of software vulnerabilities, dropper/downloaders, backdoor Trojan RATs, potentially unwanted applications, worms, unrestricted or liberal user access rights, redundant administrative tools, undocumented systems, and similar weaknesses. Cyber security threats on the other hand include situations like data theft, sabotage, fraud (identity or financial), user profile abuse, privilege escalation, lateral movement, system reconfiguration, subversion, and similar activities. Different parts of the organization are tasked with dealing with risks and threats, so it is important to differentiate them.
When addressing cyber issues, whether for an incident response or for a “health check” it is useful to have a frame of reference. That will guide the activities of the responders and provide objectives for communication, and distinguish results for communicating risks that can be handled by operational functions of the organization – from threats to be handled by legal, executive, or other functions.
Unfortunately most security practitioners begin with one of the least important aspects of cyber issues – malware. Malware is too often identified as a “threat” when in reality it is merely a risk, and usually a risk related to past activities. Most persistent actors move on from using malware to making use of applications already configured in the environment, or to creating new utilities from the existing environment (like proxies or web drops on reconfigured servers, or AD and VPN accounts). Some malware is used periodically as a backdoor for administrative purposes, but history has shown us conclusively that malware is merely a tool in an arsenal of tools that mostly already exist in the computers they access and manipulate. Malware can be a useful identifier (or indicator of compromise), but it will reflect past risks more than current threats. Malware doesn’t matter as much as data loss, or the ongoing use of credentials, or unrestricted lateral access between systems, or ingress/egress via VPN or web services, or other activities that manipulate the legitimate services and architecture of the organization.
When communicating risks and threats in a cyber security incident (or health check), prioritize the activities that are more indicative of threats over the indicators of risks. Focus on the activities that have the most potential to impact the organization. Help manage risks by understanding and distinguishing between risks and threats.
Malware does matter, but not so much. It is one contributing factor of many, and has reduced importance over time as aggressors integrate their activities for purpose. The following diagram provides a recommended approach to communicating 5 priority factors in incident response or the assessment of the cyber security health of an organization.
Figure 2: Organizational Priorities of Diagnostic Assessment for CSIRT Contingency Management
A focus on the following 5 particular areas of priority importance to security incident response (as described in the previous diagram) will help provide context in communications:
- Data Exfiltration or Sabotage– has any information been targeted, or are there indications of harmful programs or changes that might cause business interruptions?
- User Profile Propagation– what user profiles were created on how many computers and when? This may indicate malicious use of credentials by insiders or intruders.
- Lateral Movement– what methods and credentials were used to access which computers? This may indicate patterns of activity that could reveal threat activities.
- Malware Indicators and Attributes– are any known or recognizable malware in active filesystem, processes, services, communications, or scheduled tasks? These are identified by the characteristic differences of exploits, downloaders, proxies, and RATs – and may include non-malicious “PUPs” (potentially unwanted programs) as well as the use of native or common administrative utilities.
- Build/Configurations Inconsistencies– are Operating System builds and versions consistent; are applications versions that are vulnerable to zero- or single-day exploits consistent?
Ironically, a lot of press and attention has been focused on malware as the indicator of “Advanced Persistent Threat” in the past several years. In fact malware is only one indicator of those activities. The most important thing to an organization in terms of cyber security is data loss. That should be looked for first, and is a much better indicator of compromise than itinerant programs on discreet systems. Similarly, the most accurate, and much easier to detect, indicator of APT is the unauthorized or anomalous use of valid credentials in the enterprise.
Fundamentally every user will have a unique profile, and usually a single (or at least few) systems that profile is to be used on. When their profile is used on other systems it should be alerted as a questionable activity, and reviewed. It seems intuitive that user profiles will exist on many systems, and when desktops were shared that was indeed the case – but most computers are single-user systems today, thus these types of anomalous access are very useful to identifying illegitimate use history.
The items described above are generalized as an approach. In fact circumstances will dictate which indicators an IR team may be faced with first, but having a framework to contextualize threats vs. risks can help keep focus for communications sake and allow effective and efficient efforts. In football sometimes you have to drop back in order create space to gain yards on a play.
Enough is Enough
CSIRT information collection has traditionally been focused upon forensic data – computer disk images, network logs, transaction logs, and similar. The most common enterprise/organizational software used to collect forensic information from endpoint hosts are from Guidance Software and Access Data. EnCase and FTK (respectively) have become the standard of forensic collection and analysis tools. Local collection with those tools requires some technical awareness and experience to use the software. Both also have network-capable products to collect remote disk copies, but that can be expensive. The expense is not only financial, but also operational. To capture a remote disk copy a suitable network connection is needed to support the collection; however as important (and more commonly the obstacle faced by private sector companies operating internationally) are the privacy laws and restrictions concerning transfer of personal or non-public information. Data collection from remote systems is not simply a technical issue. It is important to collect information from systems for triage, but the process should not create more obstacles (such as legal) for the responders or the organization.
We have learned from experience that there is a suitable method of investigation that can be performed that will both collect sufficient evidence, and protect it for legal purposes if needed. It can also be accomplished simply and at very low operational cost (think MB instead of GB of data collected remotely with no PII/NPI). The method is scalable, and indeed most effective with more endpoint host data than less – meaning it encourages enterprise host inclusion as an efficiency gain rather than excluding extended systems – to achieve economies of scale. EnCase and FTK are excellent products and an important tool in the CSIRT kit, but as with any expensive tools they should be limited according to resource constraints in an economic balance of the cost versus benefit of use.
To explain a more efficient framework it may be useful first to describe the activities of CSIRT contingency management. Our Presponse methodology provides an example framework:
Figure 4: Cylance Presponse CSIRT-CM Framework
At the outset of a cyber-security incident there is limited understanding of how many systems are involved. The scope of the investigation is accordingly the whole IT environment. Through effective diagnosis and assessment of resulting data, a practical understanding can be gained that will coincidentally increase the coverage of the IT environment while decreasing the number of systems (and consequently the resources and related costs) actually related to the scope of the investigation. This dynamic coincidence creates an economy of scale whereby the least expensive methods and tools are used first, and the most expensive resources and sophisticated tools are used last - and most efficiently. Each “phase” of activities in the framework provides indicators of compromise as attributes of suspicious behavior that are correlated to prior phases’ information sets in order to comprehensively cover the environment in the most efficient method possible.
A workflow is useful to describe the utility of information in the process of auditing or assessing information security issues. Whether the activity is initiated for a “health check” or an incident response the workflow is generally the same:
Figure 5: Presponse Workflow
The workflow demonstrates the interactivity of information as it is discovered in each phase of the CSIRT investigative activities; as new information is learned it is fed back into the workflow. Sometimes information is provided that can initiate the discovery process, sometimes the information is discovered as a result of the process.
For example, in several recent cases the Department of Homeland Security or the FBI have notified our clients of malicious communications emanating from certain hosts in their networks. Some accompanying information was provided that allowed examination of network logs to identify additional hosts, and coincidental examination of specific hosts. We utilized that information to conduct audits of the IT environment by utilizing Splunk, Change/Configuration Management tools like Altiris / SMSC / SCCM and etc., or custom scripts that allowed interrogation of as many systems in the organization as available. Then by correlating data points related to behavior and use of the systems we were able to identify the subset of “systems of interest” in the diagnostic phase to begin the investigation.
In some of those incidents certain tools proved very useful, including EnCase Enterprise, Guidance Cyber-Security, McAfee Tanium, McAfee EPO, Altiris, Microsoft Active Directory - and more fundamentally the tools that already exist on every computer and operating system for interrogating file and operating or communications sub-systems. This allowed us to cover the IT environment, and move quickly to an assessment of those systems of interest with forensic methods and tools. We utilized the information provided to initiate the workflow, and as new information was discovered – including the anomalous use of malware, system tools, credentials, and systems configurations indicating behavioral abnormalities or inconsistencies – it was fed back into the workflow to increase our understanding of the scope of the incident and reduce the eventual number of systems involved.
In each of these incidents the organizations involved had between 50,000-150,000 systems (though we have also assisted organizations with more than 300,000 systems). Although there was general malware found throughout the IT environments, malware specifically-related to the incident notifications was discovered on fewer than 50 computers in any of the individual organizations. However, the scope of compromise in each included hundreds of hosts that had been accessed without supporting malware and with tools and credentials otherwise available for use by the organization itself. By focusing on the behaviors rather than simply the IOCs we have been able to identify methods used for each stage of activity – compromise, reconnaissance, and data theft. In doing so we have been able to recommend preventative actions to help prevent subsequent stages. We have also benefited by working with existing technologies and supporting vendors of traditional malware discovery (like A/V and Antimalware software and network devices) and SIEM and similar solutions; our methods are complementary.
It is tempting to use the limited information available and pursue full disk forensics on each system identified, but it is both unnecessary and even impractical to start with that. Selective full disk forensics can help, but it should be limited by awareness of the related risks to the organization. For example, sometimes the waiting period for authorization to get access to the system of interest for forensic imaging – due to due to regional privacy regulations, and associated technical constraints, can impede the response. A more nimble and specific information collection that focuses on artifacts that do not include NPI helps an organization respond quickly. The speed of response when applied with intelligent methods to the scale of the organization can provide a more effective response.
An efficient approach to CSIRT contingency management means collecting the least information that has the most value, in other words “least cost”. A full disk image is not always necessary, it actually can contain too much information – information that may cause delays in collection either due to legal or technical impediments.