eulcore.binfile – Map binary data to Python objects

Map binary data on-disk to read-only Python objects.

This module facilitates exposing stored binary data using common Pythonic idioms. Fields in relocatable binary objects map to Python attributes using a priori knowledge about how the binary structure is organized. This is akin to the standard struct module, but with some slightly different use cases. struct, for instance, offers a more terse syntax, which is handy for certain simple structures. struct is also a bit faster since it’s implemented in C. This module’s more verbose BinaryStructure definitions give it a few advantages over struct, though:

  • This module allows users to define their own field types, where struct field types are basically inextensible.
  • The object-based nature of BinaryStructure makes it easy to add non-structural properties and methods to subclasses, which would require a bit of reimplementing and wrapping from a struct tuple.
  • BinaryStructure instances access fields through named properties instead of indexed tuples. struct tuples are fine for structures a few fields long, but when a packed binary structure grows to dozens of fields, navigating its struct tuple grows perilous.
  • BinaryStructure unpacks fields only when they’re accessed, allowing us to define libraries of structures scores of fields long, understanding that any particular application might access only one or two of them.
  • Fields in a BinaryStructure can overlap eachother, greatly simplifying both C unions and fields with multiple interpretations (integer/string, signed/unsigned).
  • This module makes sparse structures easy. If you’re reverse-engineering a large binary structure and discover a 4-byte integer in the middle of 68 bytes of unidentified mess, this module makes it easy to add an IntegerField at a known structure offset. struct requires you to split your '68x' into a '32xI32x' (or was that a '30xi34x'? Better recount.)
This package exports the following names:
  • BinaryStructure – a base class for binary data structures
  • ByteField – a field that maps fixed-length binary data to Python strings
  • LengthPrependedStringField – a field that maps variable-length binary strings to Python strings
  • IntegerField – a field that maps fixed-length binary data to Python numbers

General Usage

Suppose we have an 8-byte file whose binary data consists of the bytes 0, 1, 2, 3, etc.:

>>> with open('numbers.bin') as f:
...     f.read()
...
'\x00\x01\x02\x03\x04\x05\x06\x07'

Suppose further that these contents represent sensible binary data, laid out such that the first two bytes are a literal string value. Except that sometimes, in the binary format we’re parsing, it might sometimes be necessary to interpret those first two bytes not as a literal string, but instead as a number, encoded as a big-endian unsigned integer. Following that is a variable-length string, encoded with the total string length in the third byte.

This structure might be represented as:

from eulcommon.binfile import *
class MyObject(BinaryStructure):
    mybytes = ByteField(0, 2)
    myint = IntegerField(0, 2)
    mystring = LengthPrepededStringField(2)

Client code might then read data from that file:

>>> f = open('numbers.bin')
>>> obj = MyObject(f)
>>> obj.mybytes
'\x00\x01'
>>> obj.myint
1
>>> obj.mystring
'\x03\x04'

It’s not uncommon for such binary structures to be repeated at different points within a file. Consider if we overlay the same structure on the same file, but starting at byte 1 instead of byte 0:

>>> f = open('numbers.bin')
>>> obj = MyObject(f, offset=1)
>>> obj.mybytes
'\x01\x02'
>>> obj.myint
258
>>> obj.mystring
'\x04\x05\x06'

BinaryStructure

class eulcommon.binfile.BinaryStructure(fobj=None, mm=None, offset=0)

A superclass for binary data structures superimposed over files.

Typical users will create a subclass containing field objects (e.g., ByteField, IntegerField). Each subclass instance is created with a file and with an optional offset into that file. When code accesses fields on the instance, they are calculated from the underlying binary file data.

Instead of a file, it is occasionally appropriate to overlay an mmap structure (from the mmap standard library). This happens most often when one BinaryStructure instance creates another, passing self.mmap to the secondary object’s constructor. In this case, the caller may specify the mm argument instead of an fobj.

Parameters:
  • fobj – a file object or filename to overlay
  • mm – a mmap object to overlay
  • offset – the offset into the file where the structured data begins

Field classes

class eulcommon.binfile.ByteField(start, end)

A field mapping fixed-length binary data to Python strings.

Parameters:
  • start – The offset into the structure of the beginning of the byte data.
  • end – The offset into the structure of the end of the byte data. This is actually one past the last byte of data, so a four-byte ByteField starting at index 4 would be defined as ByteField(4, 8) and would include bytes 4, 5, 6, and 7 of the binary structure.

Typical users will create a ByteField inside a BinaryStructure subclass definition:

class MyObject(BinaryStructure):
    myfield = ByteField(0, 4) # the first 4 bytes of the file

When you instantiate the subclass and access the field, its value will be the literal bytes at that location in the structure:

>>> o = MyObject('file.bin')
>>> o.myfield
'ABCD'
class eulcommon.binfile.LengthPrependedStringField(offset)

A field mapping variable-length binary strings to Python strings.

This field accesses strings encoded with their length in their first byte and string data following that byte.

Parameters:offset – The offset of the single-byte string length.

Typical users will create a LengthPrependedStringField inside a BinaryStructure subclass definition:

class MyObject(BinaryStructure):
    myfield = LengthPrependedStringField(0)

When you instantiate the subclass and access the field, its length will be read from that location in the structure, and its data will be the bytes immediately following it. So with a file whose first bytes are '\x04ABCD':

>>> o = MyObject('file.bin')
>>> o.myfield
'ABCD'
class eulcommon.binfile.IntegerField(start, end)

A field mapping fixed-length binary data to Python numbers.

This field accessses arbitrary-length integers encoded as binary data. Currently only big-endian, unsigned integers are supported.

Parameters:
  • start – The offset into the structure of the beginning of the byte data.
  • end – The offset into the structure of the end of the byte data. This is actually one past the last byte of data, so a four-byte IntegerField starting at index 4 would be defined as IntegerField(4, 8) and would include bytes 4, 5, 6, and 7 of the binary structure.

Typical users will create an IntegerField inside a BinaryStructure subclass definition:

class MyObject(BinaryStructure):
    myfield = IntegerField(3, 6) # integer encoded in bytes 3, 4, 5

When you instantiate the subclass and access the field, its value will be big-endian unsigned integer encoded at that location in the structure. So with a file whose bytes 3, 4, and 5 are '\x00\x01\x04':

>>> o = MyObject('file.bin')
>>> o.myfield
260