grep: write error: Broken pipe with subprocess
Question:
I get couple of grep:write errors when I run this code.
What am I missing?
This is only part of it:
while d <= datetime.datetime(year, month, daysInMonth[month]):
day = d.strftime("%Y%m%d")
print day
results = [day]
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
output1=first.communicate()[0]
d += delta
day = d.strftime("%Y%m%d")
second=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
output2=second.communicate()[0]
articleList = (output1.split('n'))
articleList2 = (output2.split('n'))
results.append( len(articleList)+len(articleList2))
w.writerow(tuple(results))
d += delta
Answers:
To find the files matching two patterns, the command structure should be:
grep -l pattern1 $(grep -l pattern2 files)
$(command)
substitutes the output of the command into the command line.
So your script should be:
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' $("+ grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt)", shell=True, stdout=subprocess.PIPE, )
and similarly for second
If you are just looking for whole words, you could use the count()
member function;
# assuming names is a list of filenames
for fn in names:
with open(fn) as infile:
text = infile.read().lower()
# remove puntuation
text = text.replace(',', '')
text = text.replace('.', '')
words = text.split()
print "Algeria:", words.count('algeria')
print "Bahrain:", words.count('bahrain')
print "protesters:", words.count('protesters')
print "protest:", words.count('protest')
If you want more powerful filtering, use re
.
When you do
A | B
in a shell, process A’s output is piped into process B as input. If process B shuts down before reading all of process A’s output (e.g. because it found what it was looking for, which is the function of the -l
option), then process A may complain that its output pipe was prematurely closed.
These errors are basically harmless, and you can work around them by redirecting stderr
in the subprocesses to /dev/null
.
A better approach, though, may simply be to use Python’s powerful regex capabilities to read the files:
def fileContains(fn, pat):
with open(file) as f:
for line in f:
if re.search(pat, line):
return True
return False
first = []
for file in glob.glob(monthDir +"/"+day+"*.txt"):
if fileContains(file, 'Algeria|Bahrain') and fileContains(file, 'Protest|protesters'):
file.append(first)
Add stderr args in the Popen function based on the python version the stderr value will change. This will support if the python version is less than 3
first=subprocess.Popen("grep -Eliw ‘Algeria|Bahrain’ "+ monthDir +"/"+day+".txt | grep -Eliw ‘Protest|protesters’ "+ monthDir +"/"+day+".txt", shell=True, stdout=subprocess.PIPE, stderr = subprocess.STDOUT)
I get couple of grep:write errors when I run this code.
What am I missing?
This is only part of it:
while d <= datetime.datetime(year, month, daysInMonth[month]):
day = d.strftime("%Y%m%d")
print day
results = [day]
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
output1=first.communicate()[0]
d += delta
day = d.strftime("%Y%m%d")
second=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
output2=second.communicate()[0]
articleList = (output1.split('n'))
articleList2 = (output2.split('n'))
results.append( len(articleList)+len(articleList2))
w.writerow(tuple(results))
d += delta
To find the files matching two patterns, the command structure should be:
grep -l pattern1 $(grep -l pattern2 files)
$(command)
substitutes the output of the command into the command line.
So your script should be:
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' $("+ grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt)", shell=True, stdout=subprocess.PIPE, )
and similarly for second
If you are just looking for whole words, you could use the count()
member function;
# assuming names is a list of filenames
for fn in names:
with open(fn) as infile:
text = infile.read().lower()
# remove puntuation
text = text.replace(',', '')
text = text.replace('.', '')
words = text.split()
print "Algeria:", words.count('algeria')
print "Bahrain:", words.count('bahrain')
print "protesters:", words.count('protesters')
print "protest:", words.count('protest')
If you want more powerful filtering, use re
.
When you do
A | B
in a shell, process A’s output is piped into process B as input. If process B shuts down before reading all of process A’s output (e.g. because it found what it was looking for, which is the function of the -l
option), then process A may complain that its output pipe was prematurely closed.
These errors are basically harmless, and you can work around them by redirecting stderr
in the subprocesses to /dev/null
.
A better approach, though, may simply be to use Python’s powerful regex capabilities to read the files:
def fileContains(fn, pat):
with open(file) as f:
for line in f:
if re.search(pat, line):
return True
return False
first = []
for file in glob.glob(monthDir +"/"+day+"*.txt"):
if fileContains(file, 'Algeria|Bahrain') and fileContains(file, 'Protest|protesters'):
file.append(first)
Add stderr args in the Popen function based on the python version the stderr value will change. This will support if the python version is less than 3
first=subprocess.Popen("grep -Eliw ‘Algeria|Bahrain’ "+ monthDir +"/"+day+".txt | grep -Eliw ‘Protest|protesters’ "+ monthDir +"/"+day+".txt", shell=True, stdout=subprocess.PIPE, stderr = subprocess.STDOUT)