Extractor for PCAP Files

pcapkit.foundation.extraction contains Extractor only, which synthesises file I/O and protocol analysis, coordinates information exchange in all network layers, extracts parametres from a PCAP file.

Todo

Implement engine support for pypcap & pycapfile.

class pcapkit.foundation.extraction.Extractor(fin=None, fout=None, format=None, auto=True, extension=True, store=True, files=False, nofile=False, verbose=False, engine=None, layer=None, protocol=None, ip=False, ipv4=False, ipv6=False, tcp=False, strict=True, trace=False, trace_fout=None, trace_format=None, trace_byteorder='little', trace_nanosecond=False)[source]

Bases: object

Extractor for PCAP files.

For supported engines, please refer to corresponding driver method for more information:

_ifnm: str

Input file name.

_ofnm: str

Output file name.

_fext: str

Output file extension.

_flag_a: bool

Auto extraction flag (as the auto parameter).

_flag_d: bool

Data storing flag (as the store parameter).

_flag_e: bool

EOF flag.

_flag_f: bool

Split output into files flag (as the files parameter).

_flag_m: bool

Multiprocessing engine flag.

_flag_q: bool

No output flag (as the nofile parameter).

_flag_t: bool

TCP flow tracing flag (as the trace parameter).

_flag_v: Union[bool, Callable[[pcapkit.foundation.extraction.Extractor, pcapkit.protocols.pcap.frame.Frame]]]

A bool value or a function takes the Extract instance and current parsed frame (depends on the engine selected) as parameters to print verbose output information (as the verbose parameter).

_vfunc: Union[NotImplemented, Callable[[pcapkit.foundation.extraction.Extractor, pcapkit.protocols.pcap.frame.Frame]]]

If the verbose parameter is a callable, then it will be assigned as self._vfunc; otherwise, it keeps NotImplemented as a placeholder and has specific function for each engine.

_frnum: int

Current frame number.

_frame: List[pcapkit.protocols.pcap.frame.Frame]

Frame records storage.

_proto: pcapkit.corekit.protochain.ProtoChain

Current frame’s protocol chain.

_reasm: List[Optiona[pcapkit.reassembly.ipv4.IPv4_Reassembly], Optiona[pcapkit.reassembly.ipv6.IPv6_Reassembly], Optiona[pcapkit.reassembly.tcp.TCP_Reassembly]]

Reassembly buffers.

_trace: Optional[pcapkit.foundation.traceflow.TraceFlow]

TCP flow tracer.

_ipv4: bool

IPv4 reassembly flag (as the ipv4 and/or ip flag).

_ipv6: bool

IPv6 reassembly flag (as the ipv6 and/or ip flag).

_tcp: bool

TCP reassembly flag (as the tcp parameter).

_exptl: str

Extract til protocol flag (as the protocol parameter).

_exlyr: str

Extract til layer flag (as the layer parameter).

_exeng: str

Extration engine (as the engine parameter).

_ifile: io.BufferedReader

Source PCAP file (opened in binary mode).

_ofile: Optional[Union[dictdumper.dumper.Dumper, Type[dictdumper.dumper.Dumper]]]

Output dumper. If self._flag_f is True, it is the Dumper object, otherwise it is an initialised Dumper instance.

Note

We customised the object_hook() method to provide generic support of enum.Enum, ipaddress.IPv4Address, ipaddress.IPv6Address and Info.

See also

When the output format is unsupported, we uses NotImplementedIO as a fallback solution.

_gbhdr: pcapkit.protocols.pcap.header.Header

Parsed PCAP global header instance.

_vinfo: pcapkit.corekit.version.VersionInfo

The version info of the PCAP file (as the self._gbhdr.version property).

Protocol type of data link layer (as the self._gbhdr.protocol property).

_nnsec: bool

Nanosecond PCAP file flag (as the self._gbhdr.nanosecond property).

_type: str

Output format (as the self._ofile.kind property).

_expkg: types.ModuleType

Extraction engine module.

_extmp: Iterator[Any]

Temporary storage for frame parsing iterator.

_mpprc: List[multiprocessing.Process]

List of active child processes.

_mpfdp: DefaultDict[multiprocessing.Queue]

File pointer (offset) queue for each frame.

_mpmng: multiprocessing.sharedctypes.multiprocessing.Manager

Multiprocessing manager context.

_mpkit: multiprocessing.managers.SyncManager.Namespace

Multiprocessing utility namespace.

_mpkit.counter: int

Number of active workers.

_mpkit.pool: int

Number of prepared workers.

_mpkit.current: int

Current processing frame number.

_mpkit.eof: bool

EOF flag.

_mpkit.frames: Dict[int, pcapkit.protocols.pcap.frame.Frame]

Frame storage.

_mpkit.trace: Optional[pcapkit.foundation.traceflow.TraceFlow]

TCP flow tracer.

_mpkit.reassembly: List[Optiona[pcapkit.reassembly.ipv4.IPv4_Reassembly], Optiona[pcapkit.reassembly.ipv6.IPv6_Reassembly], Optiona[pcapkit.reassembly.tcp.TCP_Reassembly]]

Reassembly buffers.

_mpsrv: multiprocessing.Proccess

Server process for frame analysis and processing.

_mpbuf: Union[multiprocessing.managers.SyncManager.dict, Dict[int, pcapkit.protocols.pcap.frame.Frame]]

Multiprocessing buffer for parsed PCAP frames.

_mpfrm: Union[multiprocessing.managers.SyncManager.list, List[pcapkit.protocols.pcap.frame.Frame]]

Multiprocessing storage for proccessed PCAP frames.

_mprsm: Union[multiprocessing.managers.SyncManager.list, List[Optiona[pcapkit.reassembly.ipv4.IPv4_Reassembly], Optiona[pcapkit.reassembly.ipv6.IPv6_Reassembly], Optiona[pcapkit.reassembly.tcp.TCP_Reassembly]]]

Multiprocessing storage for reassembly buffers.

__call__()[source]

Works as a simple wrapper for the iteration protocol.

Raises

IterableError – If self._flag_a is True, as iteration is not applicable.

__enter__()[source]

Uses Extractor as a context manager.

__exit__(exc_type, exc_value, traceback)[source]

Close the input file when exits.

__init__(fin=None, fout=None, format=None, auto=True, extension=True, store=True, files=False, nofile=False, verbose=False, engine=None, layer=None, protocol=None, ip=False, ipv4=False, ipv6=False, tcp=False, strict=True, trace=False, trace_fout=None, trace_format=None, trace_byteorder='little', trace_nanosecond=False)[source]

Initialise PCAP Reader.

Parameters
  • fin (Optiona[str]) – file name to be read; if file not exist, raise FileNotFound

  • fout (Optiona[str]) – file name to be written

  • format (Optional[Literal['plist', 'json', 'tree']]) – file format of output

  • auto (bool) – if automatically run till EOF

  • extension (bool) – if check and append extensions to output file

  • store (bool) – if store extracted packet info

  • files (bool) – if split each frame into different files

  • nofile (bool) – if no output file is to be dumped

  • (Union[bool (verbose) – pcapkit.protocol.pcap.frame.Frame]]]): a bool value or a function takes the Extract instance and current parsed frame (depends on engine selected) as parameters to print verbose output information

  • Callable[[pcapkit.foundation.extraction.Extractor – pcapkit.protocol.pcap.frame.Frame]]]): a bool value or a function takes the Extract instance and current parsed frame (depends on engine selected) as parameters to print verbose output information

:parampcapkit.protocol.pcap.frame.Frame]]]): a bool value or a function takes the Extract

instance and current parsed frame (depends on engine selected) as parameters to print verbose output information

Parameters
  • engine (Optional[Literal['default', 'pcapkit', 'dpkt', 'scapy', 'pyshark', 'server', 'pipeline']]) – extraction engine to be used

  • layer (Optional[Literal['Link', 'Internet', 'Transport', 'Application']]) – extract til which layer

  • protocol (Optional[Union[str, Tuple[str], Type[Protocol]]]) – extract til which protocol

  • ip (bool) – if record data for IPv4 & IPv6 reassembly

  • ipv4 (bool) – if perform IPv4 reassembly

  • ipv6 (bool) – if perform IPv6 reassembly

  • tcp (bool) – if perform TCP reassembly

  • strict (bool) – if set strict flag for reassembly

  • trace (bool) – if trace TCP traffic flows

  • trace_fout (Optional[str]) – path name for flow tracer if necessary

  • trace_format (Optional[Literal['plist', 'json', 'tree', 'pcap']]) – output file format of flow tracer

  • trace_byteorder (Literal['little', 'big']) – output file byte order

  • trace_nanosecond (bool) – output nanosecond-resolution file flag

Warns

FormatWarning – Warns under following circumstances:

  • If using PCAP output for TCP flow tracing while the extraction engine is PyShark.

  • If output file format is not supported.

__iter__()[source]

Iterate and parse PCAP frame.

Raises

IterableError – If self._flag_a is True, as such operation is not applicable.

__next__()[source]

Iterate and parse next PCAP frame.

It will call _read_frame() to parse next PCAP frame internally, until the EOF reached; then it calls _cleanup() for the aftermath.

_aftermathmp()[source]

Aftermath for multiprocessing.

The method will join all child processes forked/spawned as in self._mpprc, and will join self._mpsrv server process if using multiprocessing server engine.

For multiprocessing server engine, it will

For multiprocessing pipeline engine, it will

After restoring attributes, it will shutdown multiprocessing manager context self._mpmng, delete all multiprocessing attributes (i.e. starts with _mp), and deduct the frame number self._frnum by 2 (hacking solution).

Notes

If self._flag_e is already set as True, do nothing.

Raises

UnsupportedCall – If self._flag_m is False, as such operation is not applicable.

_cleanup()[source]

Cleanup after extraction & analysis.

The method clears the self._expkg and self._extmp attributes, sets self._flag_e as True and closes the input file.

_default_read_frame(*, frame=None, mpkit=None)[source]

Read frames with default engine.

This method performs following operations:

  • extract frames and each layer of packets;

  • make Info object out of frame properties;

  • write to output file with corresponding dumper;

  • reassemble IP and/or TCP datagram;

  • trace TCP flows if any;

  • record frame Info object to frame storage.

Keyword Arguments
Returns

Parsed frame instance.

Return type

Optional[pcapkit.protocols.pcap.frame.Frame]

_dpkt_read_frame()[source]

Read frames with DPKT engine.

Returns

Parsed frame instance.

Return type

dpkt.dpkt.Packet

See also

Please refer to _default_read_frame() for more operational information.

_pipeline_read_frame(*, mpfdp, mpkit)[source]

Extract frame with multiprocessing pipeline engine.

The method calls Frame to parse the PCAP frame data. Should EOFError raised, it will toggle self._mpkit.eof as True. Finally, it will decendant self.mpkit.counter by 1 and closes the input source file (as the child process exits).

For the parsed Frame instance, the instant will first wait until self.mpkit.current is the same as self._frnum, i.e. it’s now time to process the parsed frame as in a linear sequential order.

It will proceed by calling _default_read_frame(), whilst temporarily assigning self.mpkit.trace to self._trace and self.mpkit.reassembly to self._reasm then put back.

Keyword Arguments
Raises

EOFError – If self._flag_e is True, as the parsing had finished.

_pyshark_read_frame()[source]

Read frames with PyShark engine.

Returns

Parsed frame instance.

Return type

pyshark.packet.packet.Packet

Notes

This method inserts packet2dict() to the parsed frame instance as packet2dict() method.

See also

Please refer to _default_read_frame() for more operational information.

_read_frame()[source]

Headquarters for frame reader.

This method is a dispatcher for parsing frames.

Returns

The parsed frame instance.

_run_dpkt(dpkt)[source]

Call dpkt.pcap.Reader to extract PCAP files.

This method assigns self._expkg as dpkt and self._extmp as an iterator from dpkt.pcap.Reader.

Parameters

dpkt (types.ModuleType) – The dpkt module.

Warns

AttributeWarning – If self._exlyr and/or self._exptl is provided as the DPKT engine currently does not support such operations.

_run_pipeline(multiprocessing)[source]

Use pipeline multiprocessing to extract PCAP files.

Notes

The basic concept of multiprocessing pipeline engine is that we parse the PCAP file as a pipeline. Each frame per worker. Once the length of a frame is known, i.e. the PCAP frame header is parsed, then we can start a new working and start parsing the next frame concurrently.

However, as the datagram reassembly and TCP flow tracing require linear sequential processing, we still need to wait for the completion of analysis on previous frames before proceeding on such operations.

This method assigns self._expkg as multiprocessing, creates a file pointer storage as self._mpfdp, manager context as self._mpmng and namespace as self._mpkit.

In the namespace, we initiate number of (on duty) workers as counter, pool of (ready) workers as pool, current frame number as current, EOF flag as eof, frame storage as frames, TCP flow tracer self._trace as trace and the reassembly buffers self._reasm as reassembly.

After initial setup, the method calls record_header() to parse the PCAP global header and put the file offset to self._mpfdp as the start of first frame. Then it starts the parsing of each PCAP frame.

During this phrase, it’s a while clause until self._mpkit.eof is set as True then it calls _update_eof() and breaks. In the while clause, it maintains a multiprocessing.pool.Pool like worker pool. It checks the self._mpkit.pool for available workers and self._mpkit.counter for active workers.

When starts a new worker, it first update the input file offset to the file offset as specified in self._mpfdp. Then creates a child process running _pipeline_read_frame() with keyword arguments mpkit as self._mpkit and mpfdp as corresponding Queue from self._mpfdp. Later, it decendants the self._mpkit.pool and increments the self._mpkit.counter, both by 1. The child process will be appended to self._mpprc.

When the number of active workers is greater than or equal to CPU_CNT, it waits and join the leading child processes in self._mpprc then removes their reference.

Parameters

multiprocessing (types.ModuleType) – The multiprocessing module.

Warns

AttributeWarning – If self._flag_q is False, as multiprocessing engines do not support output.

Raises

UnsupportedCall – If self._flag_m is False, as such operation is not applicable.

_run_pyshark(pyshark)[source]

Call pyshark.FileCapture to extract PCAP files.

This method assigns self._expkg as pyshark and self._extmp as an iterator from pyshark.FileCapture.

Parameters

pyshark (types.ModuleType) – The pyshark module.

Warns

AttributeWarning – Warns under following circumstances:

  • if self._exlyr and/or self._exptl is provided as the PyShark engine currently does not support such operations.

  • if reassembly is enabled, as the PyShark engine currently does not support such operation.

_run_scapy(scapy_all)[source]

Call scapy.all.sniff() to extract PCAP files.

This method assigns self._expkg as scapy.all and self._extmp as an iterator from scapy.all.sniff().

Parameters

scapy_all (types.ModuleType) – The scapy.all module.

Warns

AttributeWarning – If self._exlyr and/or self._exptl is provided as the Scapy engine currently does not support such operations.

_run_server(multiprocessing)[source]

Use server multiprocessing to extract PCAP files.

Notes

The basic concept of multiprocessing server engine is that we further separate the logic of PCAP frame parsing and analysis/processing, comparing to the multiprocessing pipeline engine (c.f. _run_pipeline()).

We starts a server process to perform the datagram reassembly and TCP flow tracing, etc. of all parsed PCAP frames, whilst parsing each PCAP frame in the same manner as in multiprocessing pipeline engine, i.e. each frame per worker.

This method assigns self._expkg as multiprocessing, creates a file pointer storage as self._mpfdp, manager context as self._mpmng and namespace as self._mpkit. We will also maintain the active process list self._mpprc as in _run_pipeline().

It will also creates a dict as self._mpbuf, frame buffer (temporary storage) for the server process to obtain the parsed frames; a list as self._mpfrm, eventual frame storage; and another list as self._mprsm, storing the reassembly buffers self._reasm before the server process exits.

In the namespace, we initiate number of (on duty) workers as counter, pool of (ready) workers as pool, current frame number as current, EOF flag as eof, frame storage as frames, and trace for storing TCP flow tracer self._trace before the server process exits.

After initial setup, the method calls record_header() to parse the PCAP global header and put the file offset to self._mpfdp as the start of first frame. It will then starts the server process self._mpsrv from _server_analyse_frame(). Finally, it starts the parsing of each PCAP frame.

During this phrase, it’s a while clause until self._mpkit.eof is set as True then it calls _update_eof() and breaks. In the while clause, it maintains a multiprocessing.pool.Pool like worker pool. It checks the self._mpkit.pool for available workers and self._mpkit.counter for active workers.

When starts a new worker, it first update the input file offset to the file offset as specified in self._mpfdp. Then creates a child process running _server_extract_frame() with keyword arguments mpkit as self._mpkit, mpbuf as self._mpbuf and mpfdp as corresponding Queue from self._mpfdp. Later, it decendants the self._mpkit.pool and increments the self._mpkit.counter, both by 1. The child process will be appended to self._mpprc.

When the number of active workers is greater than or equal to CPU_CNT, it waits and join the leading child processes in self._mpprc then removes their reference.

Parameters

multiprocessing (types.ModuleType) – The multiprocessing module.

Warns

AttributeWarning – If self._flag_q is False, as multiprocessing engines do not support output.

Raises

UnsupportedCall – If self._flag_m is False, as such operation is not applicable.

_scapy_read_frame()[source]

Read frames with Scapy engine.

Returns

Parsed frame instance.

Return type

scapy.packet.Packet

See also

Please refer to _default_read_frame() for more operational information.

_server_analyse_frame(*, mpkit, mpfrm, mprsm, mpbuf)[source]

Analyse frame using multiprocessing server engine.

This method starts a while clause. For each round, it will pop the frame self._frnum from mpbuf then calls _default_read_frame() to perform datagram reassembly and TCP flow tracing, etc.

Once the frame popped is EOFError, i.e. the frame parsing had finished, it breaks from the clause and updates mpfrm with self._frame, mprsm with self._reasm, and mpkit.trace with self._trace.

Keyword Arguments
_server_extract_frame(*, mpfdp, mpkit, mpbuf)[source]

Extract frame using multiprocessing server engine.

The method calls Frame to parse the PCAP frame data. The parsed frame will be saved to mpbuf under the corresponding frame number self._frnum.

Should EOFError raised, it will toggle self._mpkit.eof as True, and save EOFError object to mpbuf under the corresponding frame number self._frnum.

Finally, it will decendant self.mpkit.counter by 1 and closes the input source file (as the child process exits).

Parameters
  • mpfdp (multiprocessing.Queue) – Queue for multiprocessing file pointer (offset).

  • mpkit (multiprocessing.managers.SyncManager.Namespace) – Namespace instance as _mpkit.

  • mpbuf (multiprocessing.managers.SyncManager.dict) – Frame buffer (temporary storage) for the server process self._mpsrv to obtain the parsed frames.

Raises

EOFError – If self._flag_e is True, as the parsing had finished.

_update_eof()[source]

Update EOF flag.

This method calls _aftermathmp() to cleanup multiproccessing stuff, closes the input file and toggle self._flag_e as True.

check()[source]

Check layer and protocol thresholds.

Warns

See also

  • List of available layers: LAYER_LIST

  • List of available protocols: PROTO_LIST

static import_test(engine, *, name=None)[source]

Test import for extractcion engine.

Parameters

engine (str) – Extraction engine module name.

Keyword Arguments

name (Optional[str]) – Extraction engine display name.

Warns

EngineWarning – If the engine module is not installed.

Returns

If succeeded, returns True and the module; otherwise, returns False and None.

Return type

Tuple[bool, Optional[ModuleType]]

classmethod make_name(fin, fout, fmt, extension, *, files=False, nofile=False)[source]

Generate input and output filenames.

The method will perform following processing:

  1. sanitise fin as the input PCAP filename; in.pcap as default value and append .pcap extension if needed and extension is True; as well as test if the file exists;

  2. if nofile is True, skips following processing;

  3. if fmt provided, then it presumes corresponding output file extension;

  4. if fout not provided, it presumes the output file name based on the presumptive file extension; the stem of the output file name is set as out; should the file extension is not available, then it raises FormatError;

  5. if fout provided, it presumes corresponding output format if needed; should the presumption cannot be made, then it raises FormatError;

  6. it will also append corresponding file extension to the output file name if needed and extension is True.

Parameters
  • fin (Optional[str]) – Input filename.

  • fout (Optional[str]) – Output filename.

  • fmt (str) – Output file format.

  • extension (bool) – If append .pcap file extension to the input filename if fin does not have such file extension; if check and append extensions to output file.

Keyword Arguments
  • files (bool) – If split each frame into different files.

  • nofile (bool) – If no output file is to be dumped.

Returns

Generated input and output filenames:

  1. input filename

  2. output filename / directory name

  3. output format

  4. output file extension (without .)

  5. if split each frame into different files

Return type

Tuple[str, str, str, str, bool]

Raises
  • FileNotFound – If input file does not exists.

  • FormatError – If output format not provided and cannot be presumpted.

record_frames()[source]

Read packet frames.

The method calls _read_frame() to parse each frame from the input PCAP file; and calls _cleanup() upon complision.

Notes

Under non-auto mode, i.e. self._flag_a is False, the method performs no action.

record_header()[source]

Read global header.

The method will parse the PCAP global header and save the parsed result as self._gbhdr. Information such as PCAP version, data link layer protocol type, nanosecond flag and byteorder will also be save the current Extractor instance.

If TCP flow tracing is enabled, the nanosecond flag and byteorder will be used for the output PCAP file of the traced TCP flows.

For output, the method will dump the parsed PCAP global header under the name of Global Header.

run()[source]

Start extraction.

We uses import_test() to check if a certain engine is available or not. For supported engines, each engine has different driver method:

Warns

EngineWarning – If the extraction engine is not available. This is either due to dependency not installed, number of CPUs is not enough, or supplied engine unknown.

property engine

PCAP extraction engine.

Return type

str

property format

Format of output file.

Raises

UnsupportedCall – If self._flag_q is set as True, as output is disabled by initialisation parameter.

Return type

str

property frame

Extracted frames.

Raises

UnsupportedCall – If self._flag_d is True, as storing frame data is disabled.

Return type

Tuple[Info[DataType_Frame]]

property header

Global header.

Raises

UnsupportedCall – If self._exeng is 'scapy' or 'pyshark', as such engines does not reserve such information.

Return type

Info[DataType_Header]

property info

Version of input PCAP file.

Raises

UnsupportedCall – If self._exeng is 'scapy' or 'pyshark', as such engines does not reserve such information.

Return type

VersionInfo

property input

Name of input PCAP file.

Return type

str

property length

Frame number (of current extracted frame or all).

Return type

int

property output

Name of output file.

Raises

UnsupportedCall – If self._flag_q is set as True, as output is disabled by initialisation parameter.

Return type

str

property protocol

Protocol chain of current frame.

Raises

UnsupportedCall – If self._flag_a is True, as such attribute is not applicable.

Return type

ProtoChain

property reassembly

Frame record for reassembly.

Return type

Info

property trace

Index table for traced flow.

Raises

UnsupportedCall – If self._flag_t is True, as TCP flow tracing is disabled.

Return type

Tuple[Info]

pcapkit.foundation.extraction.CPU_CNT: int

Number of available CPUs. The value is used as the maximum concurrent workers in multiprocessing engines.

pcapkit.foundation.extraction.LAYER_LIST = {'Application', 'Internet', 'Link', 'None', 'Transport'}

List of layers.

pcapkit.foundation.extraction.PROTO_LIST = {'ah', 'application', 'arp', 'drarp', 'ethernet', 'frame', 'ftp', 'header', 'hip', 'hopopt', 'http', 'httpv1', 'httpv2', 'inarp', 'internet', 'ip', 'ipsec', 'ipv4', 'ipv6', 'ipv6_frag', 'ipv6_opts', 'ipv6_route', 'ipx', 'l2tp', 'link', 'mh', 'null', 'ospf', 'protocol', 'rarp', 'raw', 'tcp', 'transport', 'udp', 'vlan'}

List of protocols.