VirusTotal stores a vast collection of files, URLs, domains, and IPs submitted by users worldwide. It features a variety of functionalities and integrates third-party detection engines and tools to analyze the maliciousness of submitted artifacts and gather relevant related information, such as file properties, domain registrars, and execution behaviors. The VirusTotal dataset, the backbone of the platform, structures artifact-related information into objects and represents relevant relationships between them, providing contextual links between various artifacts. This makes VirusTotal a valuable resource for threat research, enabling users to perform activities such as clustering artifacts related to specific threat actors or campaigns, tracking malicious activities, and analyzing trends in the threat landscape. In this post, part of a collaborative effort between VirusTotal and SentinelLabs, we explore how to effectively use VirusTotal’s wide range of querying capabilities, highlight scenarios in which these capabilities return informative results, and discuss factors that may impact the completeness or relevance of the data. The content is aimed at VirusTotal users seeking to better understand the fundamental inner workings of the platform and how to effectively use it as part of their investigations. This contribution complements the comprehensive VirusTotal documentation by discussing certain aspects in greater detail along with a summary of relevant context and usage information, and demonstrating how VirusTotal capabilities are applied in real-world cases. Overview The VirusTotal platform analyzes files and network-related artifacts (URLs, domains, and IPs) submitted to the platform to detect maliciousness. The platform aggregates results from third-party detection engines, web scanners, and other tools to provide thorough analysis overviews. VirusTotal stores submitted artifacts as well as information related to each artifact in a dataset, which we refer to as the VirusTotal dataset. The artifact-related information is extensive and diverse, including, for example, file properties such as filename, file type, digital signatures, and hashes, as well as URL components such as domains, URL paths, and URL query parameters. VirusTotal provides interfaces for users to interact with the platform and search filters for querying it. The search filters allow for retrieving and pivoting through artifact-related information. Additionally, they enable clustering multiple artifacts and identifying newly submitted ones based on user-defined queries aimed at capturing overlaps in content or related information. This has many use cases in threat intelligence, such as identifying trends in the threat landscape, commonalities between different threats, and tracking specific threat groups or campaigns. Below, we first provide an overview of the VirusTotal dataset structure and the different interfaces available for interacting with it, with a focus on querying the dataset using search filters. We then delve into search modifiers, a specific type of VirusTotal search filter, highlighting modifiers for querying data generated by artificial intelligence (AI) technology. Next, we discuss factors that may impact the relevance or completeness of results when querying using search modifiers, including an example of using search modifiers in an actual threat research investigation. The VirusTotal Dataset The VirusTotal dataset is the backbone of the VirusTotal platform. Current records indicate that it stores a vast amount of submitted artifacts and related information, including over 50 billion files, 6 billion URLs, and 4 billion domains. The data is stored in a structured and hierarchical manner. The top-level structure of the VirusTotal dataset Artifact-related information is structured into objects, which have an ID, a type, and attributes. Optionally, objects may also have one or multiple relationships. VirusTotal objects The object ID uniquely identifies an object. An object can be directly related to a submitted artifact: a file, URL, domain, or IP address. In this case, the object's ID is derived from the artifact itself — the SHA-256 hash for files, the IP address or domain itself for IPs or domains, and the SHA-256 hash or the Base-64 encoded form for URLs. An object type indicates the type of information stored by an object. For example, an object of type file stores file-specific information about submitted files, such as filenames, file hashes, and the file extension, whereas an object of type domain stores domain-specific information about submitted domains, such as the domain’s registrar. The objects of type file, url, ip, or domain are directly related to submitted artifacts, while the rest, such as threat_actor or reference, are not. An attribute is a data item that stores information related to an object and can be of a primitive or a complex data type, such as an array or a structure. For example, the file object (an object of type file) includes the string attribute sha256. This attribute stores the SHA-256 hash of a submitted file. Additionally, it may contain the attribute lnk_info, specifically present for Windows Shortcut (LNK) files. lnk_info is a structure containing information specific to LNK files and extracted from the submitted file, such as the date at which the shortcut had been created. Relationships signify connections between objects of the same or different types, making them particularly useful for describing scenarios involving multiple artifacts. For instance, the malicious file config.lnk file (SHA-256 hash 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61) communicates with the IP address 149.51.230[.]198 to download a payload. In the context of VirusTotal, this relationship is represented through the communicating_files relationship of the ip_address object, which is directly related to 149.51.230[.]198. communicating_files stores information about all files, in the form of file objects, observed to communicate with the IP address during sandbox execution. VirusTotal executes submitted executable files in sandboxes to capture behaviors and artifacts visible only while the file is executing, such as started processes, network communications, changes to the file system, or strings present in process memory. The communicating_files relationship Top-level collections group all objects of the same type. The top-level collections that VirusTotal currently implements are files (a set of all objects of type file), urls (a set of all objects of type url), ips (a set of all objects of type ip), domains (a set of all objects of type domain), collections (a set of all objects of type collection), threat actors (a set of all objects of type threat_actor), and references (a set of all objects of type reference). The object of type collection is not to be confused with top-level collections. This object groups multiple objects of the same or different types given a user-specified context, such as a threat actor, a malicious campaign, or a malware family. The top-level collections enable operations that relate to the set of all objects of a given type. Such an operation is submitting a new file for analysis, which adds an object of type file to the files collection. Querying VirusTotal VirusTotal exposes two interfaces for interacting with its dataset: the platform’s graphical user interface (GUI) for manual interaction and the application program interface (API) for programmatic interaction. The VirusTotal GUI The VirusTotal GUI is the web interface of the platform. To query VirusTotal using the GUI, users enter a search query into the search field. A search query is composed of one or multiple search filters. A search filter can be a value uniquely identifying a submitted artifact – a URL, domain, file hash (MD5, SHA-1, or SHA-256), or an IP address. This filter cannot be combined with other filters and is used for retrieving information related to only a single submitted artifact. When a user inputs a value, VirusTotal retrieves the corresponding artifact and related information from its dataset and displays this data to the user as a web analysis report. To retrieve the artifact and its related information, VirusTotal searches its dataset for an object that has an ID or an attribute (for MD5 or SHA-1 hashes) that matches the user-provided value. The platform also explores relationships to and from this object. Web analysis report (search query: 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61) Querying using value uniquely identifying a submitted artifact A search filter can also be a search modifier in the format modifier:value, where value may be a predefined or a user-specified search criterion. Each modifier is mapped to one or more of the top-level collections files, urls, ips, domains, and collections, forming sets of file-, URL-, IP-, domain-, and collection-specific search modifiers. Further information on these modifiers can be found in the official VirusTotal documentation. Multiple modifiers can be combined into more complex search queries using the logical operators AND, OR, and NOT. Parentheses can be used to group modifiers and logical operators, allowing for more precise queries by controlling the order of operations. All modifiers within a search query must be mapped to a single top-level collection. A special modifier is entity, which defines the top-level collection to which the search query is applied. For example, entity:file applies the search query to the files collection, and entity:url applies the query to the urls collection. If a user does not explicitly specify the entity modifier, the platform defaults to entity:file. Structured overview of commonly used file-specific search modifiers (entity:file) For each modifier, VirusTotal searches for and retrieves objects within the collection that the modifier is mapped to. These objects have an ID or attribute, or a relationship to another object with an ID or attribute, that meets the search criterion. For example, entity:domain AND domain:test instructs VirusTotal to retrieve all objects of type domain whose IDs (the domains themselves) contain the string test. An exception is the content modifier, which instructs the platform to search through the content of submitted files. In the case of a complex search query combining multiple modifiers using AND, OR, and/or NOT, VirusTotal combines the retrieved objects for each modifier into a resulting set that meets the combined criteria. The platform then displays an overview of the resulting set of objects to the user in the form of a list of web analysis results. When a user clicks on an item in the list, the platform generates a web analysis report based on the corresponding object’s ID, as described previously. List of web analysis results (search query: entity:domain AND domain:test) Querying using search modifiers