Building Your Threat Test Set

Even mature organizations find great difficulty in testing their threat detection capabilities. Part of that reason is that every organization is unique. Many, while they wouldn’t admit it, often rely on test sample sets that offer false hope of truly testing their unique environments. For example, an architectural firm might use a specific datatype for its drawings and plans versus a healthcare environment that uses specialized image formats for x-rays or MRIs.

So, testing those environments with a generalized test sample set that focuses on a broad range of Office-related documents would likely supply false hope for those security teams and, ultimately, increase their risk. This is why it is recommended that security teams should tailor their samples to meet their unique environment along with any generalized test set.  This would include a better balance of PE32 files, Office documents and unique file types that match their current environment.

How many samples do I need for each file type?

Understand a test with too small a sample size has a good chance of being inaccurate or inconclusive. As you will come to understand a sample set of 2,000 samples with a confidence of 95% has a margin of error of 2%. Therefore, it is recommended that a sample size in the order of 2,000 be used to have confidence with your results. Feel free to read more about sample size and margin of error at https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/.

What’s next? Well, here is the crux of the problem. How do I get samples of unknown malware? Any malware you get from a repository or malware feed has already been seen by any vendor who subscribes to these services. This is all known malware.

There are two recognized methods for curating "unknown or zero-day malware": Time lag analysis and creating unique generated samples. Time Lag Analysis involves setting up a scenario where you disconnect your solutions under test and allow for no internet updates to occur for a period of time. During this same period you can collect the number of samples and file types to then be tested. With your solution under test (SUT) still not connected to the internet so that it can’t be updated, you use these collected samples to perform the test on the SUT. Essentially you have taken known malware to test against the device that haven’t seen these new samples because they have been quarantined. Remember don’t allow the solution to update when you run your test either, some solutions might not allow you to do this, which should raise its own questions.

The creation of unique samples, while the best, is very difficult to manage. Therefore it is recommended that the time lag method be used. For more information on the creation of malware or zero-day testing we recommend you read the following from the Anti-Malware Testing Standards Organization (AMTSO) at https://www.amtso.org/documents/.

I’m sure you have gathered by now, this isn’t a walk in the park, but it is achievable. Ask your vendors questions. The more you understand how they test and validate their own solutions is important in your quest to deploy the best solutions for your organization. So again, for known threat testing, define a file type breakdown for the test and filter samples that you consider to be malicious either based on industry consensus or other validation methods. Gather lots of samples, 2000 or more. For false positive testing, create a set of files that include both malicious and benign files, these could include PUPs or PUAs and again, unless you are an expert in the field of malware use the time lag analysis method for developing your unknown or zero-day malware.


Nick Arraje has over 20 years of technical sales experience working with companies in telecom, network visibility and cybersecurity. While working in the telecom market, he worked with Tier 1 and network equipment manufactures on network simulation tools and testing scenarios to validate product features and capabilities.  Nick has degree in Electrical Engineering from Northeastern University.


The Difficulties of Testing Threat Detection

In some respect, we are all new to cybersecurity. As threat vectors change and malware evolves or appears on a daily basis, security vendors continue to provide new and unique detection solutions to find these threats using some of the most cutting-edge technology like artificial intelligence, machine learning, behavioral analytics and more.

Yet, do they work? All this seems great if you plan on accepting the marketing hype. But hype is often greatly different from reality. Many vendors provide detection accuracies and false positive or negative rates, but how do you, as a customer, go about testing or validating these claims?

To start, you will need a large (thousands or more) number of samples of known and unknown malware. Additionally, you will want a good distribution of different types of malware both file-based and fileless threats which should include Windows Executables, DMG, Office, Adobe, Rich Text Format, Image Formats, Android APK and JavaScript.

Where can you get these samples? There are several online malware repositories and feeds available including VirusTotal for known malware. One note here, malware creators or organizations that have threat intelligence teams will likely have these kinds of "known & unknown" threats in their libraries.

Yet, once tested, “unknown” threats become known. So, a proposed solution would be to utilize an existing malware repository to create a test set which best exemplifies the threat profile of the organization and then control the testing environment to create “unknown” test set of malware. For unknown malware, we will discuss this in more detail in my next blog as this is a little more complex; stay tuned!

To understand the complexity and problem even further there is disagreement in the security community on "What is and isn’t malware?” In a test using VirusTotal to download a batch of 1,000 samples, it is likely that 30% or more of the samples will not be considered malware by the majority of VirusTotal’s 70+ antivirus scanners; meaning half of the antivirus engines will not identify the suspicious file as malware and the other half will. This is due to the fact that included in these samples are what is known as "Potentially Unwanted Programs and Applications" (PUPs and PUAs) and these types of suspicious files continue to be debated in the community as malware or not. If PUPs and PUAs are an area you want (or are required) to test then you should raise this with your vendor as many do not consider these to be malware, which will affect your false positive rate.

One recommendation for collecting known malware would be to use vendor consensus to confirm samples that are being collected. Secondly, if you download a sample set from VirusTotal, the majority of the samples will likely be PE32 GUIs. In my example of 1,000 downloaded samples, 90% were PE32 GUIs. Most feeds and repositories contain a range of file types including: Windows Executables, Android APKs, PDFs, Images, JavaScript and other file types. So build a set of samples for each file type.

Feel free to reach out and share your ideas on how to go about creating a test scenario for malware detection and curating known and unknown malware. In my next blog, I will talk about why sample size is important and the currently recognized methods for curating "unknown" samples.


Nick Arraje has over 20 years of technical sales experience working with companies in telecom, network visibility and cybersecurity. While working in the telecom market, he worked with Tier 1 and network equipment manufactures on network simulation tools and testing scenarios to validate product features and capabilities.  Nick has degree in Electrical Engineering from Northeastern University.