cigma: A Pure Python Filetype Identification Library


« 📅 published on 22/Nov/2015 »

🔖 tagged code


Introduction

Filetype identification is the process of scanning respective magicbytes for each filetype within input streams. The mapping for various filetypes and their magicbytes are stored in different formats which is then queried by the identification library. If a match is found, input stream is classified to be of that type. In case of match collisions, a signature preference logic prioritizes matches. Tools like TrID and others use similar logic.

cigma is a Python library to identify filetypes. It provides libmagic like mimetype identification of a file or data buffer. This is similar to what the file command on *nix systems will provide:

$ file cigma.py
cigma.py: Python script, ASCII text executable
$
$ file readme.md
readme.md: Python script, ASCII text executable, with very long lines
$
$ file /bin/ls
/bin/ls: ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=9d2a434c4ff55aad2ddd19348c0ac75971606483, stripped
$
$ file /etc/passwd
/etc/passwd: ASCII text
$
$ file /dev/sda
/dev/sda: block special

Usage and Usecases

cigma uses a custom JSON formatted signature file to stores filetype mappings. Let's try identifying a few files:

$ python cigma.py /bin/ls
('/bin/ls',
 {'id': 29,
  'longname': 'Executable and Linkable Format (ELF)',
  'mimetype': 'application/x-executable',
  'patterns': [{'offset': 0, 'regex': '\\x7F\\x45\\x4C\\x46', 'size': 4}],
  'shortname': 'ELF'})
$
$ python cigma.py ~/toolbox/testfiles/binary/suspicious/35d249cdd501aeb5a5b39daeb4f275c41c73e91ef299a094d27edbfd0396715d.VXE
('/home/shiv/toolbox/testfiles/binary/suspicious/35d249cdd501aeb5a5b39daeb4f275c41c73e91ef299a094d27edbfd0396715d.VXE',
 {'id': 31,
  'longname': 'Windows Executable',
  'mimetype': 'application/x-dosexec',
  'patterns': [{'offset': 0, 'regex': '\\x4D\\x5A', 'size': 2}],
  'shortname': 'EXE'})
$
$ python cigma.py ~/toolbox/testfiles/pcaps/exploitkits/2015-04-03-Nuclear-EK-traffic.pcap
('/home/shiv/toolbox/testfiles/pcaps/exploitkits/2015-04-03-Nuclear-EK-traffic.pcap',
 {'id': 5,
  'longname': 'Packet Capture (winpcap)',
  'mimetype': 'application/vnd.tcpdump.pcap',
  'patterns': [{'offset': 0, 'regex': '\\xD4\\xC3\\xB2\\xA1', 'size': 4}],
  'shortname': 'PCAP'})

If a match is found, cigma returns a (source, resultdict) tuple. Here is a snippet of few signatures from the current set:

{
  "meta": {
    "rulescount": 72,
    "version": 0.1
  },
  "rules": [
    {
      "id": 1,
      "longname": "PKZIP",
      "mimetype": "application/octet-stream",
      "patterns": [
        {
          "offset": 0,
          "regex": "\\x50\\x4B\\x05\\x06",
          "size": 4
        }
      ],
      "shortname": "ZIP"
    },
    {
      "id": 5,
      "longname": "Packet Capture (winpcap)",
      "mimetype": "application/vnd.tcpdump.pcap",
      "patterns": [
        {
          "offset": 0,
          "regex": "\\xD4\\xC3\\xB2\\xA1",
          "size": 4
        }
      ],
      "shortname": "PCAP"
    },
    {
      "id": 27,
      "longname": "Flash Video",
      "mimetype": "video/x-flv",
      "patterns": [
        {
          "offset": 0,
          "regex": "\\x46\\x4C\\x56",
          "size": 3
        }
      ],
      "shortname": "FLV"
    },
    {
      "id": 29,
      "longname": "Executable and Linkable Format (ELF)",
      "mimetype": "application/x-executable",
      "patterns": [
        {
          "offset": 0,
          "regex": "\\x7F\\x45\\x4C\\x46",
          "size": 4
        }
      ],
      "shortname": "ELF"
    }
  ]
}

Adding a new signature is really easy. Just edit the included magicbytes.json file and add a new node with an unique id. Each node should have a patterns list which should contain the offset, regex and size keys with respective values.

Using cigma as a library is really easy:

from cigma import Cigma

One this is done, you can identify a file buffer:

with open("/home/shiv/toolbox/testfiles/binary/crafted_corkami/gui.exe") as fo:
  filedata = fo.read()

Cigma(data=filedata).cigma()
('databuffer',
 {'longname': 'Windows Executable',
  'mimetype': 'application/x-dosexec',
  'patterns': [{'offset': 0, 'regex': '\\x4D\\x5A', 'size': 2}],
  'shortname': 'EXE'})

Or else you can also identify a file itself:

Cigma(filename="/home/shiv/toolbox/testfiles/binary/crafted_corkami/gui.exe").cigma()
('/home/shiv/toolbox/testfiles/binary/crafted_corkami/gui.exe',
 {'longname': 'Windows Executable',
  'mimetype': 'application/x-dosexec',
  'patterns': [{'offset': 0, 'regex': '\\x4D\\x5A', 'size': 2}],
  'shortname': 'EXE'})

Conclusion

The primary purpose of cigma is to be able to identify mundane files at lightning speeds. It doesn't aim to provide exhaustive insight into each file by parsing and decoding it. Currently, there are 70+ signatures and identification for more filetypes will be available as the project grows.


« capinfos.py: Pure Python Pcap ... «

» FireEye FLARE On 2014 Challeng... »

  