Extractor for PCAP Files¶
pcapkit.foundation.extraction contains
Extractor only,
which synthesises file I/O and protocol analysis,
coordinates information exchange in all network layers,
extracts parametres from a PCAP file.
-
class
pcapkit.foundation.extraction.Extractor(fin=None, fout=None, format=None, auto=True, extension=True, store=True, files=False, nofile=False, verbose=False, engine=None, layer=None, protocol=None, ip=False, ipv4=False, ipv6=False, tcp=False, strict=True, trace=False, trace_fout=None, trace_format=None, trace_byteorder='little', trace_nanosecond=False)[source]¶ Bases:
objectExtractor for PCAP files.
For supported engines, please refer to corresponding driver method for more information:
Default drivers:
Global header:
record_header()Packet frames:
record_frames()
DPKT driver:
_run_dpkt()Scapy driver:
_run_scapy()PyShark driver:
_run_pyshark()Multiprocessing driver:
Pipeline model:
_run_pipeline()Server model:
_run_server()
-
_flag_v: Union[bool, Callable[[pcapkit.foundation.extraction.Extractor, pcapkit.protocols.pcap.frame.Frame]]]¶ A
boolvalue or a function takes theExtractinstance and current parsed frame (depends on the engine selected) as parameters to print verbose output information (as theverboseparameter).
-
_vfunc: Union[NotImplemented, Callable[[pcapkit.foundation.extraction.Extractor, pcapkit.protocols.pcap.frame.Frame]]]¶ If the
verboseparameter is a callable, then it will be assigned asself._vfunc; otherwise, it keepsNotImplementedas a placeholder and has specific function for each engine.
-
_frame: List[pcapkit.protocols.pcap.frame.Frame]¶ Frame records storage.
-
_proto: pcapkit.corekit.protochain.ProtoChain¶ Current frame’s protocol chain.
-
_reasm: List[Optiona[pcapkit.reassembly.ipv4.IPv4_Reassembly], Optiona[pcapkit.reassembly.ipv6.IPv6_Reassembly], Optiona[pcapkit.reassembly.tcp.TCP_Reassembly]]¶ Reassembly buffers.
-
_trace: Optional[pcapkit.foundation.traceflow.TraceFlow]¶ TCP flow tracer.
-
_ifile: io.BufferedReader¶ Source PCAP file (opened in binary mode).
-
_ofile: Optional[Union[dictdumper.dumper.Dumper, Type[dictdumper.dumper.Dumper]]]¶ Output dumper. If
self._flag_fisTrue, it is theDumperobject, otherwise it is an initialisedDumperinstance.Note
We customised the
object_hook()method to provide generic support ofenum.Enum,ipaddress.IPv4Address,ipaddress.IPv6AddressandInfo.See also
When the output format is unsupported, we uses
NotImplementedIOas a fallback solution.
-
_gbhdr: pcapkit.protocols.pcap.header.Header¶ Parsed PCAP global header instance.
-
_vinfo: pcapkit.corekit.version.VersionInfo¶ The version info of the PCAP file (as the
self._gbhdr.versionproperty).
-
_dlink: pcapkit.const.reg.linktype.LinkType¶ Protocol type of data link layer (as the
self._gbhdr.protocolproperty).
-
_nnsec: bool¶ Nanosecond PCAP file flag (as the
self._gbhdr.nanosecondproperty).
-
_type: str¶ Output format (as the
self._ofile.kindproperty).
-
_expkg: types.ModuleType¶ Extraction engine module.
-
_extmp: Iterator[Any]¶ Temporary storage for frame parsing iterator.
-
_mpprc: List[multiprocessing.Process]¶ List of active child processes.
-
_mpfdp: DefaultDict[multiprocessing.Queue]¶ File pointer (offset) queue for each frame.
-
_mpmng: multiprocessing.sharedctypes.multiprocessing.Manager¶ Multiprocessing manager context.
-
_mpkit: multiprocessing.managers.SyncManager.Namespace¶ Multiprocessing utility namespace.
-
_mpkit.frames: Dict[int, pcapkit.protocols.pcap.frame.Frame]¶ Frame storage.
-
_mpkit.trace: Optional[pcapkit.foundation.traceflow.TraceFlow]¶ TCP flow tracer.
-
_mpkit.reassembly: List[Optiona[pcapkit.reassembly.ipv4.IPv4_Reassembly], Optiona[pcapkit.reassembly.ipv6.IPv6_Reassembly], Optiona[pcapkit.reassembly.tcp.TCP_Reassembly]]¶ Reassembly buffers.
-
_mpsrv: multiprocessing.Proccess¶ Server process for frame analysis and processing.
-
_mpbuf: Union[multiprocessing.managers.SyncManager.dict, Dict[int, pcapkit.protocols.pcap.frame.Frame]]¶ Multiprocessing buffer for parsed PCAP frames.
-
_mpfrm: Union[multiprocessing.managers.SyncManager.list, List[pcapkit.protocols.pcap.frame.Frame]]¶ Multiprocessing storage for proccessed PCAP frames.
-
_mprsm: Union[multiprocessing.managers.SyncManager.list, List[Optiona[pcapkit.reassembly.ipv4.IPv4_Reassembly], Optiona[pcapkit.reassembly.ipv6.IPv6_Reassembly], Optiona[pcapkit.reassembly.tcp.TCP_Reassembly]]]¶ Multiprocessing storage for reassembly buffers.
-
__call__()[source]¶ Works as a simple wrapper for the iteration protocol.
- Raises
IterableError – If
self._flag_aisTrue, as iteration is not applicable.
-
__init__(fin=None, fout=None, format=None, auto=True, extension=True, store=True, files=False, nofile=False, verbose=False, engine=None, layer=None, protocol=None, ip=False, ipv4=False, ipv6=False, tcp=False, strict=True, trace=False, trace_fout=None, trace_format=None, trace_byteorder='little', trace_nanosecond=False)[source]¶ Initialise PCAP Reader.
- Parameters
fin (Optiona[str]) – file name to be read; if file not exist, raise
FileNotFoundfout (Optiona[str]) – file name to be written
format (Optional[Literal['plist', 'json', 'tree']]) – file format of output
auto (bool) – if automatically run till EOF
extension (bool) – if check and append extensions to output file
store (bool) – if store extracted packet info
files (bool) – if split each frame into different files
nofile (bool) – if no output file is to be dumped
(Union[bool, Callable[[pcapkit.foundation.extraction.Extractor, (verbose) – pcapkit.protocol.pcap.frame.Frame]]]): a
boolvalue or a function takes theExtractinstance and current parsed frame (depends on engine selected) as parameters to print verbose output informationengine (Optional[Literal['default', 'pcapkit', 'dpkt', 'scapy', 'pyshark', 'server', 'pipeline']]) – extraction engine to be used
layer (Optional[Literal['Link', 'Internet', 'Transport', 'Application']]) – extract til which layer
protocol (Optional[Union[str, Tuple[str], Type[Protocol]]]) – extract til which protocol
ip (bool) – if record data for IPv4 & IPv6 reassembly
ipv4 (bool) – if perform IPv4 reassembly
ipv6 (bool) – if perform IPv6 reassembly
tcp (bool) – if perform TCP reassembly
strict (bool) – if set strict flag for reassembly
trace (bool) – if trace TCP traffic flows
trace_fout (Optional[str]) – path name for flow tracer if necessary
trace_format (Optional[Literal['plist', 'json', 'tree', 'pcap']]) – output file format of flow tracer
trace_byteorder (Literal['little', 'big']) – output file byte order
trace_nanosecond (bool) – output nanosecond-resolution file flag
- Warns
FormatWarning – Warns under following circumstances:
If using PCAP output for TCP flow tracing while the extraction engine is PyShark.
If output file format is not supported.
-
__iter__()[source]¶ Iterate and parse PCAP frame.
- Raises
IterableError – If
self._flag_aisTrue, as such operation is not applicable.
-
__next__()[source]¶ Iterate and parse next PCAP frame.
It will call
_read_frame()to parse next PCAP frame internally, until the EOF reached; then it calls_cleanup()for the aftermath.
-
_aftermathmp()[source]¶ Aftermath for multiprocessing.
The method will join all child processes forked/spawned as in
self._mpprc, and will joinself._mpsrvserver process if using multiprocessing server engine.For multiprocessing server engine, it will
assign
self._mpfrmtoself._frameassign
self._mprsmtoself._reasmcopy
self._mpkit.tracetoself._trace
For multiprocessing pipeline engine, it will
restore
self._framefromself._mpkit.framescopy
self._mpkit.reassemblytoself._reasmcopy
self._mpkit.tracetoself._trace
After restoring attributes, it will shutdown multiprocessing manager context
self._mpmng, delete all multiprocessing attributes (i.e. starts with _mp), and deduct the frame numberself._frnumby 2 (hacking solution).Notes
If
self._flag_eis already set asTrue, do nothing.- Raises
UnsupportedCall – If
self._flag_misFalse, as such operation is not applicable.
-
_cleanup()[source]¶ Cleanup after extraction & analysis.
The method clears the
self._expkgandself._extmpattributes, setsself._flag_easTrueand closes the input file.
-
_default_read_frame(*, frame=None, mpkit=None)[source]¶ Read frames with default engine.
This method performs following operations:
extract frames and each layer of packets;
make
Infoobject out of frame properties;write to output file with corresponding dumper;
reassemble IP and/or TCP datagram;
trace TCP flows if any;
record frame
Infoobject to frame storage.
- Keyword Arguments
frame (Optional[pcapkit.protocols.pcap.frame.Frame]) – The fallback
framedata (for multiprocessing engines).mpkit (multiprocessing.managers.SyncManager.Namespace) – The multiprocess data kit.
- Returns
Parsed frame instance.
- Return type
Optional[pcapkit.protocols.pcap.frame.Frame]
-
_dpkt_read_frame()[source]¶ Read frames with DPKT engine.
- Returns
Parsed frame instance.
- Return type
See also
Please refer to
_default_read_frame()for more operational information.
-
_pipeline_read_frame(*, mpfdp, mpkit)[source]¶ Extract frame with multiprocessing pipeline engine.
The method calls
Frameto parse the PCAP frame data. ShouldEOFErrorraised, it will toggleself._mpkit.eofasTrue. Finally, it will decendantself.mpkit.counterby1and closes the input source file (as the child process exits).For the parsed
Frameinstance, the instant will first wait untilself.mpkit.currentis the same asself._frnum, i.e. it’s now time to process the parsed frame as in a linear sequential order.It will proceed by calling
_default_read_frame(), whilst temporarily assigningself.mpkit.tracetoself._traceandself.mpkit.reassemblytoself._reasmthen put back.- Keyword Arguments
mpfdp (multiprocessing.Queue) –
Queuefor multiprocessing file pointer (offset).mpkit (multiprocessing.managers.SyncManager.Namespace) –
Namespaceinstance asself._mpkit.
- Raise:
- EOFError: If
self._flag_e is
True, as the parsing had finished.
- EOFError: If
-
_pyshark_read_frame()[source]¶ Read frames with PyShark engine.
- Returns
Parsed frame instance.
- Return type
pyshark.packet.packet.Packet
Notes
This method inserts
packet2dict()to the parsed frame instance aspacket2dict()method.See also
Please refer to
_default_read_frame()for more operational information.
-
_read_frame()[source]¶ Headquarters for frame reader.
This method is a dispatcher for parsing frames.
For Scapy engine, calls
_scapy_read_frame().For DPKT engine, calls
_dpkt_read_frame().For PyShark engine, calls
_pyshark_read_frame().For default (PyPCAPKit) engine, calls
_default_read_frame().
- Returns
The parsed frame instance.
-
_run_dpkt(dpkt)[source]¶ Call
dpkt.pcap.Readerto extract PCAP files.This method assigns
self._expkgasdpktandself._extmpas an iterator fromdpkt.pcap.Reader.- Parameters
dpkt (types.ModuleType) – The
dpktmodule.- Warns
AttributeWarning – If
self._exlyrand/orself._exptlis provided as the DPKT engine currently does not support such operations.
-
_run_pipeline(multiprocessing)[source]¶ Use pipeline multiprocessing to extract PCAP files.
Notes
The basic concept of multiprocessing pipeline engine is that we parse the PCAP file as a pipeline. Each frame per worker. Once the length of a frame is known, i.e. the PCAP frame header is parsed, then we can start a new working and start parsing the next frame concurrently.
However, as the datagram reassembly and TCP flow tracing require linear sequential processing, we still need to wait for the completion of analysis on previous frames before proceeding on such operations.
This method assigns
self._expkgasmultiprocessing, creates a file pointer storage asself._mpfdp, manager context asself._mpmngand namespace asself._mpkit.In the namespace, we initiate number of (on duty) workers as
counter, pool of (ready) workers aspool, current frame number ascurrent, EOF flag aseof, frame storage asframes, TCP flow tracerself._traceastraceand the reassembly buffersself._reasmasreassembly.After initial setup, the method calls
record_header()to parse the PCAP global header and put the file offset toself._mpfdpas the start of first frame. Then it starts the parsing of each PCAP frame.During this phrase, it’s a
whileclause untilself._mpkit.eofis set asTruethen it calls_update_eof()and breaks. In thewhileclause, it maintains amultiprocessing.pool.Poollike worker pool. It checks theself._mpkit.poolfor available workers andself._mpkit.counterfor active workers.When starts a new worker, it first update the input file offset to the file offset as specified in
self._mpfdp. Then creates a child process running_pipeline_read_frame()with keyword argumentsmpkitasself._mpkitandmpfdpas correspondingQueuefromself._mpfdp. Later, it decendants theself._mpkit.pooland increments theself._mpkit.counter, both by1. The child process will be appended toself._mpprc.When the number of active workers is greater than or equal to
CPU_CNT, it waits and join the leading child processes inself._mpprcthen removes their reference.- Parameters
multiprocessing (types.ModuleType) – The
multiprocessingmodule.- Warns
AttributeWarning – If
self._flag_qisFalse, as multiprocessing engines do not support output.- Raises
UnsupportedCall – If
self._flag_misFalse, as such operation is not applicable.
-
_run_pyshark(pyshark)[source]¶ Call
pyshark.FileCaptureto extract PCAP files.This method assigns
self._expkgaspysharkandself._extmpas an iterator frompyshark.FileCapture.- Parameters
pyshark (types.ModuleType) – The
pysharkmodule.- Warns
AttributeWarning – Warns under following circumstances:
if
self._exlyrand/orself._exptlis provided as the PyShark engine currently does not support such operations.if reassembly is enabled, as the PyShark engine currently does not support such operation.
-
_run_scapy(scapy_all)[source]¶ Call
scapy.all.sniff()to extract PCAP files.This method assigns
self._expkgasscapy.allandself._extmpas an iterator fromscapy.all.sniff().- Parameters
scapy_all (types.ModuleType) – The
scapy.allmodule.- Warns
AttributeWarning – If
self._exlyrand/orself._exptlis provided as the Scapy engine currently does not support such operations.
-
_run_server(multiprocessing)[source]¶ Use server multiprocessing to extract PCAP files.
Notes
The basic concept of multiprocessing server engine is that we further separate the logic of PCAP frame parsing and analysis/processing, comparing to the multiprocessing pipeline engine (c.f.
_run_pipeline()).We starts a server process to perform the datagram reassembly and TCP flow tracing, etc. of all parsed PCAP frames, whilst parsing each PCAP frame in the same manner as in multiprocessing pipeline engine, i.e. each frame per worker.
This method assigns
self._expkgasmultiprocessing, creates a file pointer storage asself._mpfdp, manager context asself._mpmngand namespace asself._mpkit. We will also maintain the active process listself._mpprcas in_run_pipeline().It will also creates a
dictasself._mpbuf, frame buffer (temporary storage) for the server process to obtain the parsed frames; alistasself._mpfrm, eventual frame storage; and anotherlistasself._mprsm, storing the reassembly buffersself._reasmbefore the server process exits.In the namespace, we initiate number of (on duty) workers as
counter, pool of (ready) workers aspool, current frame number ascurrent, EOF flag aseof, frame storage asframes, andtracefor storing TCP flow tracerself._tracebefore the server process exits.After initial setup, the method calls
record_header()to parse the PCAP global header and put the file offset toself._mpfdpas the start of first frame. It will then starts the server processself._mpsrvfrom_server_analyse_frame(). Finally, it starts the parsing of each PCAP frame.During this phrase, it’s a
whileclause untilself._mpkit.eofis set asTruethen it calls_update_eof()and breaks. In thewhileclause, it maintains amultiprocessing.pool.Poollike worker pool. It checks theself._mpkit.poolfor available workers andself._mpkit.counterfor active workers.When starts a new worker, it first update the input file offset to the file offset as specified in
self._mpfdp. Then creates a child process running_server_extract_frame()with keyword argumentsmpkitasself._mpkit, mpbuf asself._mpbufandmpfdpas correspondingQueuefromself._mpfdp. Later, it decendants theself._mpkit.pooland increments theself._mpkit.counter, both by1. The child process will be appended toself._mpprc.When the number of active workers is greater than or equal to
CPU_CNT, it waits and join the leading child processes inself._mpprcthen removes their reference.- Parameters
multiprocessing (types.ModuleType) – The
multiprocessingmodule.- Warns
AttributeWarning – If
self._flag_qisFalse, as multiprocessing engines do not support output.- Raises
UnsupportedCall – If
self._flag_misFalse, as such operation is not applicable.
-
_scapy_read_frame()[source]¶ Read frames with Scapy engine.
- Returns
Parsed frame instance.
- Return type
See also
Please refer to
_default_read_frame()for more operational information.
-
_server_analyse_frame(*, mpkit, mpfrm, mprsm, mpbuf)[source]¶ Analyse frame using multiprocessing server engine.
This method starts a
whileclause. For each round, it will pop the frameself._frnumfrommpbufthen calls_default_read_frame()to perform datagram reassembly and TCP flow tracing, etc.Once the frame popped is
EOFError, i.e. the frame parsing had finished, it breaks from the clause and updatesmpfrmwithself._frame,mprsmwithself._reasm, andmpkit.tracewithself._trace.- Keyword Arguments
mpkit (multiprocessing.managers.SyncManager.Namespace) –
Namespaceinstance as_mpkit.mpfrm (multiprocessing.managers.SyncManager.list) – Frame storage.
mprsm (multiprocessing.managers.SyncManager.list) – Reassembly buffers.
mpbuf (multiprocessing.managers.SyncManager.dict) – Frame buffer (temporary storage) for the server process
self._mpsrvto obtain the parsed frames.
-
_server_extract_frame(*, mpfdp, mpkit, mpbuf)[source]¶ Extract frame using multiprocessing server engine.
The method calls
Frameto parse the PCAP frame data. The parsed frame will be saved tompbufunder the corresponding frame numberself._frnum.Should
EOFErrorraised, it will toggleself._mpkit.eofasTrue, and saveEOFErrorobject tompbufunder the corresponding frame numberself._frnum.Finally, it will decendant
self.mpkit.counterby1and closes the input source file (as the child process exits).- Parameters
mpfdp (multiprocessing.Queue) –
Queuefor multiprocessing file pointer (offset).mpkit (multiprocessing.managers.SyncManager.Namespace) –
Namespaceinstance as_mpkit.mpbuf (multiprocessing.managers.SyncManager.dict) – Frame buffer (temporary storage) for the server process
self._mpsrvto obtain the parsed frames.
- Raise:
- EOFError: If
self._flag_e is
True, as the parsing had finished.
- EOFError: If
-
_update_eof()[source]¶ Update EOF flag.
This method calls
_aftermathmp()to cleanup multiproccessing stuff, closes the input file and toggleself._flag_easTrue.
-
check()[source]¶ Check layer and protocol thresholds.
- Warns
LayerWarning – If
self._exlyris not recognised.ProtocolWarning – If
self._exptlis not recognised.
See also
List of available layers:
LAYER_LISTList of available protocols:
PROTO_LIST
-
static
import_test(engine, *, name=None)[source]¶ Test import for extractcion engine.
- Parameters
engine (str) – Extraction engine module name.
- Keyword Arguments
name (Optional[str]) – Extraction engine display name.
- Warns
EngineWarning – If the engine module is not installed.
- Returns
If succeeded, returns
Trueand the module; otherwise, returnsFalseandNone.- Return type
Tuple[bool, Optional[ModuleType]]
-
classmethod
make_name(fin, fout, fmt, extension, *, files=False, nofile=False)[source]¶ Generate input and output filenames.
The method will perform following processing:
sanitise
finas the input PCAP filename;in.pcapas default value and append.pcapextension if needed andextensionisTrue; as well as test if the file exists;if
nofileisTrue, skips following processing;if
fmtprovided, then it presumes corresponding output file extension;if
foutnot provided, it presumes the output file name based on the presumptive file extension; the stem of the output file name is set asout; should the file extension is not available, then it raisesFormatError;if
foutprovided, it presumes corresponding output format if needed; should the presumption cannot be made, then it raisesFormatError;it will also append corresponding file extension to the output file name if needed and
extensionisTrue.
- Parameters
- Keyword Arguments
- Returns
Generated input and output filenames:
input filename
output filename / directory name
output format
output file extension (without
.)if split each frame into different files
- Return type
- Raises
FileNotFound – If input file does not exists.
FormatError – If output format not provided and cannot be presumpted.
-
record_frames()[source]¶ Read packet frames.
The method calls
_read_frame()to parse each frame from the input PCAP file; and calls_cleanup()upon complision.Notes
Under non-auto mode, i.e.
self._flag_aisFalse, the method performs no action.
-
record_header()[source]¶ Read global header.
The method will parse the PCAP global header and save the parsed result as
self._gbhdr. Information such as PCAP version, data link layer protocol type, nanosecond flag and byteorder will also be save the currentExtractorinstance.If TCP flow tracing is enabled, the nanosecond flag and byteorder will be used for the output PCAP file of the traced TCP flows.
For output, the method will dump the parsed PCAP global header under the name of
Global Header.
-
run()[source]¶ Start extraction.
We uses
import_test()to check if a certain engine is available or not. For supported engines, each engine has different driver method:Default drivers:
Global header:
record_header()Packet frames:
record_frames()
DPKT driver:
_run_dpkt()Scapy driver:
_run_scapy()PyShark driver:
_run_pyshark()Multiprocessing driver:
Pipeline model:
_run_pipeline()Server model:
_run_server()
- Warns
EngineWarning – If the extraction engine is not available. This is either due to dependency not installed, number of CPUs is not enough, or supplied engine unknown.
-
property
format¶ Format of output file.
- Raises
UnsupportedCall – If
self._flag_qis set asTrue, as output is disabled by initialisation parameter.- Return type
-
property
frame¶ Extracted frames.
- Raises
UnsupportedCall – If
self._flag_disTrue, as storing frame data is disabled.- Return type
Tuple[Info[DataType_Frame]]
-
property
header¶ Global header.
- Raises
UnsupportedCall – If
self._exengis'scapy'or'pyshark', as such engines does not reserve such information.- Return type
-
property
info¶ Version of input PCAP file.
- Raises
UnsupportedCall – If
self._exengis'scapy'or'pyshark', as such engines does not reserve such information.- Return type
-
property
output¶ Name of output file.
- Raises
UnsupportedCall – If
self._flag_qis set asTrue, as output is disabled by initialisation parameter.- Return type
-
property
protocol¶ Protocol chain of current frame.
- Raises
UnsupportedCall – If
self._flag_aisTrue, as such attribute is not applicable.- Return type
-
property
reassembly¶ Frame record for reassembly.
ipv6– tuple of TCP payload fragment (IPv4_Reassembly)ipv4– tuple of TCP payload fragment (IPv6_Reassembly)tcp– tuple of TCP payload fragment (TCP_Reassembly)
- Return type
-
property
trace¶ Index table for traced flow.
- Raises
UnsupportedCall – If
self._flag_tisTrue, as TCP flow tracing is disabled.- Return type
Tuple[Info]
-
pcapkit.foundation.extraction.CPU_CNT: int¶ Number of available CPUs. The value is used as the maximum concurrent workers in multiprocessing engines.
-
pcapkit.foundation.extraction.LAYER_LIST= {'Application', 'Internet', 'Link', 'None', 'Transport'}¶ List of layers.
-
pcapkit.foundation.extraction.PROTO_LIST= {'ah', 'application', 'arp', 'drarp', 'ethernet', 'frame', 'ftp', 'header', 'hip', 'hopopt', 'http', 'httpv1', 'httpv2', 'inarp', 'internet', 'ip', 'ipsec', 'ipv4', 'ipv6', 'ipv6_frag', 'ipv6_opts', 'ipv6_route', 'ipx', 'l2tp', 'link', 'mh', 'null', 'ospf', 'protocol', 'rarp', 'raw', 'tcp', 'transport', 'udp', 'vlan'}¶ List of protocols.