Extractor for PCAP Files

pcapkit.foundation.extraction contains Extractor only, which synthesises file I/O and protocol analysis, coordinates information exchange in all network layers, extracts parametres from a PCAP file.

Todo

Implement engine support for pypcap & pycapfile.

class pcapkit.foundation.extraction.Extractor(fin=None, fout=None, format=None, auto=True, extension=True, store=True, files=False, nofile=False, verbose=False, engine=None, layer=None, protocol=None, ip=False, ipv4=False, ipv6=False, tcp=False, strict=True, trace=False, trace_fout=None, trace_format=None, trace_byteorder='little', trace_nanosecond=False)[source]

Bases: object

Extractor for PCAP files.

Notes

For supported engines, please refer to run().

Parameters
  • fin (Optional[str]) – file name to be read; if file not exist, raise FileNotFound

  • fout (Optional[str]) – file name to be written

  • format (Optional[Formats]) – file format of output

  • auto (bool) – if automatically run till EOF

  • extension (bool) – if check and append extensions to output file

  • store (bool) – if store extracted packet info

  • files (bool) – if split each frame into different files

  • nofile (bool) – if no output file is to be dumped

  • verbose (bool | VerboseHandler) – a bool value or a function takes the Extractor instance and current parsed frame (depends on engine selected) as parameters to print verbose output information

  • engine (Optional[Engines]) – extraction engine to be used

  • layer (Optional[Layers]) – extract til which layer

  • protocol (Optional[Protocols]) – extract til which protocol

  • ip (bool) – if record data for IPv4 & IPv6 reassembly

  • ipv4 (bool) – if perform IPv4 reassembly

  • ipv6 (bool) – if perform IPv6 reassembly

  • tcp (bool) – if perform TCP reassembly

  • strict (bool) – if set strict flag for reassembly

  • trace (bool) – if trace TCP traffic flows

  • trace_fout (Optional[str]) – path name for flow tracer if necessary

  • trace_format (Optional[Formats]) – output file format of flow tracer

  • trace_byteorder (Literal["big", "little"]) – output file byte order

  • trace_nanosecond (bool) – output nanosecond-resolution file flag

__init__(fin=None, fout=None, format=None, auto=True, extension=True, store=True, files=False, nofile=False, verbose=False, engine=None, layer=None, protocol=None, ip=False, ipv4=False, ipv6=False, tcp=False, strict=True, trace=False, trace_fout=None, trace_format=None, trace_byteorder='little', trace_nanosecond=False)[source]

Initialise PCAP Reader.

Parameters
  • fin (Optional[str]) – file name to be read; if file not exist, raise FileNotFound

  • fout (Optional[str]) – file name to be written

  • format (Optional[Formats]) – file format of output

  • auto (bool) – if automatically run till EOF

  • extension (bool) – if check and append extensions to output file

  • store (bool) – if store extracted packet info

  • files (bool) – if split each frame into different files

  • nofile (bool) – if no output file is to be dumped

  • verbose (bool | VerboseHandler) – a bool value or a function takes the Extractor instance and current parsed frame (depends on engine selected) as parameters to print verbose output information

  • engine (Optional[Engines]) – extraction engine to be used

  • layer (Optional[Layers]) – extract til which layer

  • protocol (Optional[Protocols]) – extract til which protocol

  • ip (bool) – if record data for IPv4 & IPv6 reassembly

  • ipv4 (bool) – if perform IPv4 reassembly

  • ipv6 (bool) – if perform IPv6 reassembly

  • tcp (bool) – if perform TCP reassembly

  • strict (bool) – if set strict flag for reassembly

  • trace (bool) – if trace TCP traffic flows

  • trace_fout (Optional[str]) – path name for flow tracer if necessary

  • trace_format (Optional[Formats]) – output file format of flow tracer

  • trace_byteorder (Literal["big", "little"]) – output file byte order

  • trace_nanosecond (bool) – output nanosecond-resolution file flag

Warns

FormatWarning – Warns under following circumstances:

  • If using PCAP output for TCP flow tracing while the extraction engine is PyShark.

  • If output file format is not supported.

Return type

None

__iter__()[source]

Iterate and parse PCAP frame.

Raises

IterableError – If self._flag_a is True, as such operation is not applicable.

Return type

Extractor

__next__()[source]

Iterate and parse next PCAP frame.

It will call _read_frame() to parse next PCAP frame internally, until the EOF reached; then it calls _cleanup() for the aftermath.

Return type

Frame | ScapyPacket | DPKTPacket

__call__()[source]

Works as a simple wrapper for the iteration protocol.

Raises

IterableError – If self._flag_a is True, as iteration is not applicable.

Return type

Frame | ScapyPacket | DPKTPacket

property info: VersionInfo

Version of input PCAP file.

Raises

UnsupportedCall – If self._exeng is 'scapy' or 'pyshark', as such engines does not reserve such information.

Return type

VersionInfo

property length: int

Frame number (of current extracted frame or all).

Return type

int

property format: Formats

Format of output file.

Raises

UnsupportedCall – If self._flag_q is set as True, as output is disabled by initialisation parameter.

Return type

Formats

property input: str

Name of input PCAP file.

Return type

str

property output: str

Name of output file.

Raises

UnsupportedCall – If self._flag_q is set as True, as output is disabled by initialisation parameter.

Return type

str

property header: Header

Global header.

Return type

Header

property frame: tuple[Frame, ...]

Extracted frames.

Raises

UnsupportedCall – If self._flag_d is True, as storing frame data is disabled.

property reassembly: ReassemblyData

Frame record for reassembly.

Return type

ReassemblyData

property trace: tuple[Index, ...]

Index table for traced flow.

Raises

UnsupportedCall – If self._flag_t is True, as TCP flow tracing is disabled.

Return type

tuple[Index, …]

property engine: Engines

PCAP extraction engine.

Return type

Engines

run()[source]

Start extraction.

We uses import_test() to check if a certain engine is available or not. For supported engines, each engine has different driver method:

Warns

EngineWarning – If the extraction engine is not available. This is either due to dependency not installed, or supplied engine unknown.

Return type

None

record_header()[source]

Read global header.

The method will parse the PCAP global header and save the parsed result as self._gbhdr. Information such as PCAP version, data link layer protocol type, nanosecond flag and byteorder will also be save the current Extractor instance.

If TCP flow tracing is enabled, the nanosecond flag and byteorder will be used for the output PCAP file of the traced TCP flows.

For output, the method will dump the parsed PCAP global header under the name of Global Header.

Return type

None

record_frames()[source]

Read packet frames.

The method calls _read_frame() to parse each frame from the input PCAP file; and calls _cleanup() upon complision.

Notes

Under non-auto mode, i.e. self._flag_a is False, the method performs no action.

Return type

None

classmethod register(format, module, class_, ext)[source]

Register a new dumper class.

Notes

The full qualified class name of the new dumper class should be as {module}.{class_}.

Parameters
  • format (str) – format name

  • module (str) – module name

  • class_ (str) – class name

  • ext (str) – file extension

Return type

None

classmethod make_name(fin='in.pcap', fout='out', fmt='tree', extension=True, *, files=False, nofile=False)[source]

Generate input and output filenames.

The method will perform following processing:

  1. sanitise fin as the input PCAP filename; in.pcap as default value and append .pcap extension if needed and extension is True; as well as test if the file exists;

  2. if nofile is True, skips following processing;

  3. if fmt provided, then it presumes corresponding output file extension;

  4. if fout not provided, it presumes the output file name based on the presumptive file extension; the stem of the output file name is set as out; should the file extension is not available, then it raises FormatError;

  5. if fout provided, it presumes corresponding output format if needed; should the presumption cannot be made, then it raises FormatError;

  6. it will also append corresponding file extension to the output file name if needed and extension is True.

Parameters
  • fin (str) – Input filename.

  • fout (str) – Output filename.

  • fmt (Formats) – Output file format.

  • extension (bool) – If append .pcap file extension to the input filename if fin does not have such file extension; if check and append extensions to output file.

  • files (bool) – If split each frame into different files.

  • nofile (bool) – If no output file is to be dumped.

Returns

  1. input filename

  2. output filename / directory name

  3. output format

  4. output file extension (without .)

  5. if split each frame into different files

Return type

Generated input and output filenames

Raises
  • FileNotFound – If input file does not exists.

  • FormatError – If output format not provided and cannot be presumpted.

_read_frame()[source]

Headquarters for frame reader.

This method is a dispatcher for parsing frames.

Return type

Frame | ScapyPacket | DPKTPacket

Returns

The parsed frame instance.

_cleanup()[source]

Cleanup after extraction & analysis.

The method clears the self._expkg and self._extmp attributes, sets self._flag_e as True and closes the input file.

Return type

None

_default_read_frame()[source]

Read frames with default engine.

This method performs following operations:

  • extract frames and each layer of packets;

  • make Info object out of frame properties;

  • write to output file with corresponding dumper;

  • reassemble IP and/or TCP datagram;

  • trace TCP flows if any;

  • record frame Info object to frame storage.

Return type

Frame

Returns

Parsed frame instance.

_run_scapy(scapy_all)[source]

Call scapy.all.sniff() to extract PCAP files.

This method assigns self._expkg as scapy.all and self._extmp as an iterator from scapy.all.sniff().

Parameters

scapy_all (module) – The scapy.all module.

Warns

AttributeWarning – If self._exlyr and/or self._exptl is provided as the Scapy engine currently does not support such operations.

Return type

None

_scapy_read_frame()[source]

Read frames with Scapy engine.

Return type

ScapyPacket

Returns

Parsed frame instance.

See also

Please refer to _default_read_frame() for more operational information.

_run_dpkt(dpkt)[source]

Call dpkt.pcap.Reader to extract PCAP files.

This method assigns self._expkg as dpkt and self._extmp as an iterator from dpkt.pcap.Reader.

Parameters

dpkt (module) – The dpkt module.

Warns

AttributeWarning – If self._exlyr and/or self._exptl is provided as the DPKT engine currently does not support such operations.

Return type

None

_dpkt_read_frame()[source]

Read frames with DPKT engine.

Returns

Parsed frame instance.

Return type

dpkt.dpkt.Packet

See also

Please refer to _default_read_frame() for more operational information.

_run_pyshark(pyshark)[source]

Call pyshark.FileCapture to extract PCAP files.

This method assigns self._expkg as pyshark and self._extmp as an iterator from pyshark.FileCapture.

Parameters

pyshark (types.ModuleType) – The pyshark module.

Warns

AttributeWarning – Warns under following circumstances:

  • if self._exlyr and/or self._exptl is provided as the PyShark engine currently does not support such operations.

  • if reassembly is enabled, as the PyShark engine currently does not support such operation.

Return type

None

_pyshark_read_frame()[source]

Read frames with PyShark engine.

Return type

PySharkPacket

Returns

Parsed frame instance.

See also

Please refer to _default_read_frame() for more operational information.

_flag_a: bool

Auto extract flag.

_flag_d: bool

Store data flag.

_flag_e: bool

EOF flag.

_flag_q: bool

No output file.

_flag_t: bool

Trace flag.

_exptl: Protocols

Extract til protocol.

_exlyr: Layers

Extract til layer.

_exeng: Engines

Extract using engine.

_expkg: Any

Extract module instance.

_extmp: Any

Extract iterator instance.

_gbhdr: Header

Global header.

__output__: DefaultDict[str, tuple[str, str, str | None]]

Format dumper mapping for writing output files. The values should be a tuple representing the module name, class name and file extension.

Type

DefaultDict[str, tuple[str, str, str | None]]

Data Structures

class pcapkit.foundation.extraction.ReassemblyData(ipv4, ipv6, tcp)[source]

Bases: Info

Data storage for reassembly.

Parameters
  • *args (VT) – Arbitrary positional arguments.

  • **kwargs (VT) – Arbitrary keyword arguments.

Return type

Info

ipv4: Optional[tuple[IP_Datagram, ...]]

IPv4 reassembled data.

ipv6: Optional[tuple[IP_Datagram, ...]]

IPv6 reassembled data.

tcp: Optional[tuple[TCP_Datagram, ...]]

TCP reassembled data.