Extracting Indicators of Compromise from Threat Reports using MSTICpy

Many threat intelligence reports are published every week and contain lots of useful content to improve the security in place. The Indicator of Compromise are one of the most important actionable data to extract from a threat report. Although the IOCs can be found most of the time at the end of the blogs or in complementary documents, it can sometimes be a bit long to copy and paste them manually. Several IOC extractors are already available on github. This notebook is a small POC to show how to use the IOCextractor module from the MSTICpy library to retrieve IOCs from a URL. The code can be reused to extract the ioc from any source.

Limitation

As this is an early stage of development, IOCs extracted from an URL may be false positives because the extractor does not differentiate between a malicious URL and a legitimate one. To overcome this I added a whitelist which will be used to remove any false positives but of course depending on the URL you might have more to filter out.

Features Improvement

  • Improving the extraction
  • Reduce false positives
  • extracting from multiple sources (PDF, text...)
  • Adding additional regex
  • Adding multiple export

Code

In [216]:
# Imports and configuration
import os
import glob
import requests
import json
import ipywidgets as widgets
import pandas as pd
import re
from ipywidgets import Button, Layout, Checkbox
from IPython.display import display, HTML
from bs4 import BeautifulSoup
from msticpy.sectools import IoCExtract
In [217]:
# Loading Whitelists
searchdir = "whitelists/whitelist_*.txt"
fpaths = glob.glob(searchdir)

patterns = []

# compiling the whitelist in one list
for fpath in fpaths:
    t = os.path.splitext(fpath)[0].split('_',1)[1]
    patterns += [line.strip() for line in open(fpath)]
In [218]:
# Initiate the IOC extractor
ioc_extractor = IoCExtract()

# Adding btc regex
ioc_extractor.add_ioc_type(ioc_type='btc', ioc_regex='^(?:[13]{1}[a-km-zA-HJ-NP-Z1-9]{26,33}|bc1[a-z0-9]{39,59})$')

# Configure widget
keyword = widgets.Text(
    value = "",
    placeholder = 'Enter the URL',
    description = 'Extract IOCs:',
    layout = Layout(width='90%', height='40px'),
    disabled = False
)
display(keyword)

#Configure checkbox
checkbox_json = widgets.Checkbox(value = False, description="Json")
display(checkbox_json)

checkbox_table = widgets.Checkbox(value = False, description="Table")
display(checkbox_table)

# Configure click button
button = widgets.Button(description = "Extract IOCs", display='flex', layout = Layout(width='20%', height='40px', flex='3 1 0%'), icon = 'check', button_style='primary')
output = widgets.Output()

# Box layout
box_layout = widgets.Layout(display = 'flex', flex_flow='column', align_items='center', width='100%')
box = widgets.HBox(children = [button], layout = box_layout)
display(box)

# Searching for the input url
@output.capture()
def userInput(b):
    try:
        # Request to the url
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
        result = requests.get(keyword.value, headers=headers)
        soup = BeautifulSoup(result.text, 'html.parser')
        
        print("[+] Extracting IOC from: " + keyword.value)
        iocs_found = ioc_extractor.extract(str(soup.get_text()))

        if iocs_found:
            #removing element present into the whitelists
            for k, v in iocs_found.items():
                for i in iocs_found[k].copy():
                    for w in patterns:
                        w = re.compile(w)
                        test = re.findall(w, i)
                        if test:
                            try:
                               iocs_found[k].remove(str(i))
                            except:
                               pass

            display(HTML('<h4> \nPotential IoCs found: </h4>'))
            
            # Get JSON Result
            if checkbox_json.value is True:
                ioc = {}
                for k, v in iocs_found.items():
                    value = []
                    for i in iocs_found[k].copy():
                        value.append(i)
                    ioc[k] = value

                jsonioc = json.dumps(ioc, indent=4, sort_keys=True)
                print(jsonioc)
                
            # Get table Result
            if checkbox_table.value is True:
                ioctable = pd.DataFrame([])
                
                for k, v in iocs_found.items():
                    for i in iocs_found[k].copy():
                        ioc = {}
                        ioc[k] = i
                        data = pd.DataFrame(ioc.items())
                        ioctable = ioctable.append(data)
                        
                display(ioctable)
    
        else:
            print("no IOC found!")
        
    except requests.exceptions.RequestException as e:
        print(e)
    except(AttributeError, KeyError) as er:
        print(er)
    
# get the input url
button.on_click(userInput)
display(output)

Contact

If you like this content, send a hug on Twitter for more stuff like this!