Opening links in txt file in headless browser in python

Question:

I have been having a problem running the code below and suspect the problem is with link.strip(). The program is running in a linux environment and it is supposed to open multiple links contained in a text files and opens them for snort to scan for malware. The file name is defined in the terminal before the code is executed.

import os
import subprocess
import time
import argparse

def read_links_from_file(file_path):
    links = []
    with open(file_path, 'r') as file:
        for line in file:
            links.append(line.strip())
    return links

def open_links_in_chrome(links, headless=True):
    options = '--headless' if headless else ''
    for link in links:
        subprocess.call('google-chrome {options} --app={link}', shell=True)
        time.sleep(1)

def run_snort(interface):
    subprocess.call(f'snort -i {interface}', shell=True)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', help='Path to file containing links', required=True)
    parser.add_argument('--interface', help='Network interface for Snort to listen on', required=True)
    parser.add_argument('--headless', help='Run Chrome in headless mode', action='store_true')
    args = parser.parse_args()
    file_path = args.file
    interface = args.interface
    headless = args.headless
    
    links = read_links_from_file(file_path)
    snort_process = subprocess.Popen(['snort', '-i', interface])
    open_links_in_chrome(links, headless)
    snort_process.terminate()

if __name__ == '__main__':
    main()

I tried reconfiguring the applications and rewrote the code but I’m not sure if I preserved the right code but

links.append(line.strip()) 

doesn’t seem to be the right way to go. I have also changed the sleep time from 5 to 1

After some tinkering I ended up with the following error

Acquiring network traffic from "eth0". ERROR: Can’t start DAQ (-1) – socket: Operation not permitted! Fatal Error, Quitting.. libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name
= (null) [121024:121024:0217/122814.243731:ERROR:gpu_memory_buffer_support_x11.cc(49)] dri3 extension not supported. [121070:8:0217/122815.025776:ERROR:command_buffer_proxy_impl.cc(128)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer. Fontconfig error: Cannot load default config file: No such file: (null)

Asked By: S Z

||

Answers:

I have been having a problem running the code below and suspect the problem is with link.strip().

I assume you mean line.strip() (you’re not calling link.strip() anywhere in your code). If you think the code is problematic, let’s test it. If I have a file that contains a list of four URLs in file urls.txt:

https://google.com
https://stackoverflow.com
https://www.npr.org/programs/wait-wait-dont-tell-me/
https://www.nyan.cat/

And then run the following code:

import sys

def read_links_from_file(file_path):
    links = []
    with open(file_path, 'r') as file:
        for line in file:
            links.append(line.strip())
    return links

links = read_links_from_file('urls.txt')
for i, link in enumerate(links):
    print(f'{i}: {link}')

I get the following output:

0: https://google.com
1: https://stackoverflow.com
2: https://www.npr.org/programs/wait-wait-dont-tell-me/
3: https://www.nyan.cat/

That suggest your read_links_from_file function works as expected.

On the other hand, you’re doing more work than is necessary. The default behavior of a Python file object is to act as an iterator over the lines in the file, so instead of writing this:

def read_links_from_file(file_path):
    links = []
    with open(file_path, 'r') as file:
        for line in file:
            links.append(line.strip())
    return links

links = read_links_from_file(args.file)
open_links_in_chrome(links, args.headless)

You can just delete the read_links_from_file functions and pass the open file:

with open(args.file) as links:
  open_links_in_chome((line.strip() for line in links), args.headless)

I’m cheating a bit here because in stead of simply iterating over the file I’m using a generator expression to take care of stripping the end-of-line character.


You have an error in your open_links_in_chrome function. You have written:

subprocess.call('google-chrome {options} --app={link}', shell=True)

This will result in running the literal command line…

chrome {options} --app={link}

…because you are neither using a Python f-string nor are you calling the .format() method. You need to write the function like this in order to run Chrome as expected:

def open_links_in_chrome(links, headless=True):
    options = '--headless' if headless else ''
    for link in links:
        subprocess.call(f'google-chrome {options} --app={link}', shell=True)
        time.sleep(1)

This introduces a new problem: this will successfully open Chrome with the first URL, but Chrome will never exit, so your code won’t continue past this point.

Rather than trying to fix this, I would suggest using a browser automation library like Playwright or Selenium. Here’s your code rewritten to use Playwright:

import playwright
from playwright.sync_api import sync_playwright

import subprocess
import time
import argparse
import signal

def open_links_in_chrome(links, headless=True):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=headless)
        page = browser.new_page()
        for link in links:
            print(f'fetching {link}')
            try:
                page.goto(link)
            except playwright._impl._api_types.TimeoutError:
                print(f'{link} timed out.')
            time.sleep(1)
        browser.close()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--file', help='Path to file containing links', required=True)
    parser.add_argument('--interface', help='Network interface for Snort to listen on', required=True)
    parser.add_argument('--headless', help='Run Chrome in headless mode', action='store_true')
    args = parser.parse_args()

    snort_process = subprocess.Popen(['snort', '-i', args.interface])
    with open(args.file) as links:
        open_links_in_chrome((line.strip() for line in links), headless=args.headless)
    snort_process.terminate()

if __name__ == '__main__':
    main()

If we run this — assuming we have followed the Playwright installation instructions — we see as output:

fetching https://google.com
fetching https://stackoverflow.com
fetching https://www.npr.org/programs/wait-wait-dont-tell-me/
fetching https://www.nyan.cat/

In my tests I’ve replaced snort with tcpdump, and examining the resulting packet capture I can see that we’re making the expected network requests:

$ tcpdump -r packets port 53 | grep -E 'A? (google.com|stackoverflow.com|www.npr.org|www.nyan.cat)'
reading from file packets, link-type EN10MB (Ethernet), snapshot length 262144
20:23:37.319272 IP madhatter.52135 > _gateway.domain: 52609+ A? stackoverflow.com. (35)
20:23:38.811385 IP madhatter.39144 > _gateway.domain: 15910+ AAAA? www.npr.org. (29)
20:23:38.811423 IP madhatter.52655 > _gateway.domain: 13756+ A? www.npr.org. (29)
20:23:41.214261 IP madhatter.46762 > _gateway.domain: 20587+ AAAA? www.nyan.cat. (30)
20:23:41.214286 IP madhatter.43846 > _gateway.domain: 12335+ A? www.nyan.cat. (30)
Answered By: larsks