Blogs

Hilbert curves: Visualizing binary files with color and patterns

Analyzing binary files often involves digging through raw data, looking for hidden patterns or anomalies. This can be a challenging and time-consuming process. Hilbert curves, however, offer a powerful visualization technique, making it easier to detect unusual structures or data in files. By mapping file contents to a two-dimensional grid and applying a thoughtful color-coding scheme, we can gain visual insights that traditional analysis might miss. Do you remember in The Matrix when Neo finally sees through the code? Learning to read Hilbert curves made me feel a bit like that.

What is a Hilbert curve?

A Hilbert curve is a type of fractal space-filling curve that converts a one-dimensional sequence (such as the bytes of a file) into a two-dimensional representation. It preserves data locality, meaning that bytes close together in the linear sequence are also close together on the grid. This helps maintain the structure and relationships within the data when visualized, allowing patterns and anomalies to become more apparent.

Visualizing binary files with Hilbert curves

When we apply a Hilbert curve to a binary file, we map the file’s bytes onto a 2D grid. Each byte, or small group of bytes, is represented by a point on the curve. This allows the entire file to be visualized as an image, where the spatial arrangement reflects the file’s original data structure. With this visual representation, patterns emerge, and different areas of the file can be analyzed at a glance.

Stairwell’s color coding scheme for Hilbert curve visualization

To make these visualizations meaningful, we use a color-coding scheme that highlights different byte ranges, making it easier to interpret the file’s contents. As seen below, in our approach, we use darker shades for lower values within each range and lighter shades for higher values, helping to distinguish the density and transitions of data.

Here’s how the color scheme works:

  • Null bytes (0x00) are black: Black is used to represent zero-value bytes, often corresponding to empty regions, padding, or uninitialized data in the file.
  • Low bytes (0x01-0x1F) are green: This range includes control characters, which are non-printable bytes often found in communication protocols or structured data.
  • ASCII printable characters (0x20-0x7E) are blue: These values correspond to text characters that are human-readable, such as strings and metadata.
  • Higher bytes (0x7F-0xFF) are red: High-value bytes often appear in sections containing binary data, such as executable code, compressed data, or multimedia content.

By using this gradient approach within each color range, we add more detail to the visualization, making it easier to identify subtle patterns and transitions within the data.

Practice examples

Text files, source, and scripts

Text regions and files will be represented by large swaths of blues. The well-familiar EICAR testing file is a great example of this.

 

You may notice it’s very “chunky” and it’s all blue. As Hilbert curves increase in total data represented, they increase in complexity as they are always the same overall size. If we bump it up and take a look at a Python script, we’ll also notice the introduction of the newline character brings a new color.

Binary formats provide structure

Once we start to consider executables, multiple factors impose their structure on our files. ELFs, MachOs, and PEs will all have their own unique characteristics. Let’s say Hello to the World:

We can see blocks start to emerge. Header sections here, code and string sections there. The programming languages and compilers chosen will also impose some structure.

How Hilbert Curves Can Be Used for File Analysis

The combination of Hilbert curves and the color-coding scheme allows for effective visual analysis of binary file patterns. Here’s how this technique can be applied in practice:

1. Detecting Compressed or Encrypted Data

Compressed or encrypted data often exhibits high entropy, resulting in chaotic, unpredictable patterns on the Hilbert curve. These areas typically show up as dense red patches with little uniformity. In contrast, sections with low entropy (more structured data) will exhibit more predictable color patterns. This contrast helps differentiate between structured data, such as configuration files, and high-entropy data, such as compressed archives or encrypted payloads.

Here, we have an Emotet lure document that has the standard doc formatting but also large sections of obfuscated data for the payload.

2. Assessing Rule Specificity

Hilbert curves can help in evaluating whether a YARA rule is too specific or too generic by showing the regions a rule matches within different files.

  • Too Generic: If a rule matches many disparate file types and areas, the rule may be too broad, increasing the likelihood of false positives. In this case, you’d want to tighten the rule by focusing on more specific byte patterns or strings that are unique to the malware you’re targeting.

Here, we see the results of a YARA rule which matches files of many different sizes and file types.

3. Tracking File Modifications Over Time

When comparing different versions of a file, changes in the byte-level structure can be easily identified through visual inspection. With Hilbert curve visualizations, differences such as added code, inserted data, or altered text will manifest as changes in color patterns, providing a straightforward way to detect modifications or unexpected changes.

Some of these files are calc.exe, but one of them has been altered. Hilbert anomaly analysis makes it trivial to see which one stands out.

Conclusion

The use of Hilbert curves, combined with a well-thought-out color-coding scheme, provides a new dimension to binary file analysis. It allows cybersecurity experts to visualize and identify patterns, anomalies, and structural characteristics of files at a glance. This method transforms the traditional process of manual byte analysis into a visual and intuitive approach, revealing insights that would be difficult to spot otherwise.

By mapping file contents to a 2D grid and applying color gradients to represent different byte values, Stairwell empowers security teams to detect and respond to threats more effectively. Whether it’s identifying malware, analyzing executable structures, or tracking file changes, Hilbert Curve visualizations make the invisible visible, bringing the full picture into focus.

Background pattern