Parsing apache log files

Question:

I just started learning Python and would like to read an Apache log file and put parts of each line into different lists.

line from the file

172.16.0.3 – – [25/Sep/2002:14:04:19 +0200] “GET / HTTP/1.1” 401 – “” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827”

according to Apache website the format is

%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-Agent}i

I’m able to open the file and just read it as it is but I don’t know how to make it read in that format so I can put each part in a list.

Asked By: ogward

||

Answers:

This is a job for regular expressions.

For example:

line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(d.)]+) - - [(.*?)] "(.*?)" (d+) - "(.*?)" "(.*?)"'

import re
print re.match(regex, line).groups()

The output would be a tuple with 6 pieces of information from the line (specifically, the groups within parentheses in that pattern):

('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Answered By: David Robinson

Use a regular expression to split a row into separate “tokens”:

>>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """
>>> import re
>>> map(''.join, re.findall(r'"(.*?)"|[(.*?)]|(S+)', row))
['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']

Another solution is to use a dedicated tool, e.g. http://pypi.python.org/pypi/pylogsparser/0.4

Answered By: georg

I have created a python library which does just that: apache-log-parser.

>>> import apache_log_parser
 >>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus "%r" %>s %b  "%{Referer}i" "%{User-Agent}i" %l %u")
>>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET / HTTP/1.1" 200 3478  "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -')
>>> pprint(log_line_data)
{'pid': '6113',
 'remote_host': '127.0.0.1',
 'remote_logname': '-',
 'remote_user': '',
 'request_first_line': 'GET / HTTP/1.1',
 'request_header_referer': 'https://example.com/',
 'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)',
 'response_bytes_clf': '3478',
 'status': '200',
 'time_received': '[16/Aug/2013:15:45:34 +0000]',
 'time_us': '1966093'}
Answered By: Amandasaurus

RegEx seemed extreme and problematic considering the simplicity of the format, so I wrote this little splitter which others may find useful as well:

def apache2_logrow(s):
    ''' Fast split on Apache2 log lines

    http://httpd.apache.org/docs/trunk/logs.html
    '''
    row = [ ]
    qe = qp = None # quote end character (qe) and quote parts (qp)
    for s in s.replace('r','').replace('n','').split(' '):
        if qp:
            qp.append(s)
        elif '' == s: # blanks
            row.append('')
        elif '"' == s[0]: # begin " quote "
            qp = [ s ]
            qe = '"'
        elif '[' == s[0]: # begin [ quote ]
            qp = [ s ]
            qe = ']'
        else:
            row.append(s)

        l = len(s)
        if l and qe == s[-1]: # end quote
            if l == 1 or s[-2] != '\': # don't end on escaped quotes
                row.append(' '.join(qp)[1:-1].replace('\'+qe, qe))
                qp = qe = None
    return row
Answered By: Neil C. Obremski

Add this in httpd.conf to convert the apache logs to json.

LogFormat "{"time":"%t", "remoteIP" :"%a", "host": "%V", "request_id": "%L", "request":"%U", "query" : "%q", "method":"%m", "status":"%>s", "userAgent":"%{User-agent}i", "referer":"%{Referer}i" }" json_log

CustomLog /var/log/apache_access_log json_log
CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log

Now you see you access_logs in json format.
Use the below python code to parse the json logs that are constantly getting updated.

apacheLogHandler.py

import time
f = open('apache_access_log.log', 'r')
for line in f: # read all lines already in the file
  print line.strip()

# keep waiting forever for more lines.
while True:
  line = f.readline() # just read more
  if line: # if you got something...
    print 'got data:', line.strip()
  time.sleep(1)
Answered By: Preethi Lakku
import re


HOST = r'^(?P<host>.*?)'
SPACE = r's'
IDENTITY = r'S+'
USER = r'S+'
TIME = r'(?P<time>[.*?])'
REQUEST = r'"(?P<request>.*?)"'
STATUS = r'(?P<status>d{3})'
SIZE = r'(?P<size>S+)'

REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE

def parser(log_line):
    match = re.search(REGEX,log_line)
    return ( (match.group('host'),
            match.group('time'), 
                      match.group('request') , 
                      match.group('status') ,
                      match.group('size')
                     )
                   )


logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"""
result = parser(logLine)
print(result)
Answered By: Fuji Komalan
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.