Strip n t r in scrapy

Question

I’m trying to strip r n t characters with a scrapy spider, making then a json file.

I have a “description” object which is full of new lines, and it doesn’t do what I want: matching each description to a title.

I tried with map(unicode.strip()) but it doesn’t really works. Being new to scrapy I don’t know if there’s another simpler way or how map unicode really works.

This is my code:

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error. What’s the best way?

Asked By: Lara M.

||

Source

Answer 1

unicode.strip only deals with whitespace characters at the beginning and end of strings

Return a copy of the string with the leading and trailing characters removed.

not with n, r, or t in the middle.

You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath’s normalize-space()

returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Example python shell session:

>>> text='''<html>
... <body>
... <div class="d-grid-main">
... <p class="class-name">
... 
...  This is some text,
...  with some newlines r
...  and some t tabs t too;
... 
... <a href="http://example.com"> and a link too
...  </a>
... 
... I think we're done here
... 
... </p>
... </div>
... </body>
... </html>'''
>>> response = scrapy.Selector(text=text)
>>> response.xpath('//div[@class="d-grid-main"]')
[<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">n<p class="clas'>]
>>> div = response.xpath('//div[@class="d-grid-main"]')[0]
>>> 
>>> # you'll want to use relative XPath expressions, starting with "./"
>>> div.xpath('.//p[@class="class-name"]/text()').extract()
[u'nn This is some text,n with some newlines rn and some t tabs t too;nn',
 u"nnI think we're done herenn"]
>>> 
>>> # only leading and trailing whitespace is removed by strip()
>>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
[u'This is some text,n with some newlines rn and some t tabs t too;', u"I think we're done here"]
>>> 
>>> # normalize-space() will get you a single string on the whole element
>>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
[u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
>>>

Answered By: paul trmbrth

Answer 2

As paul trmbrth suggests in his answer,

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

is likely to be what you want. However, normalize-space also condenses whitespace contained within the string into a single space. If you want only to remove r, n, and t without disturbing the other whitespace you can use translate() to remove characters.

trans_table = {ord(c): None for c in u'rnt'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

This will still leave leading and trailing whitespace that is not in the set r, n, or t. If you also want to be rid of that just insert a call to strip():

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

Answered By: mhawke

Answer 3

I’m a python, scrapy newbie, I’ve had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I’ve created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:

from w3lib.html import replace_escape_chars
yourloader.default_input_processor = MapCompose(relace_escape_chars)

Answered By: Peter Húbek

Answer 4

The simplest example to extract price from alibris.com is

response.xpath('normalize-space(//td[@class="price"]//p)').get()

Answered By: user1994

Answer 5

When I use scrapy to crawl a web page, I encounter the same problem.I have two ways to solve this problem. First use replace() function. AS “response.xpath” return a list format but replace function only operate string format.so i fetch each item of the list as a string by using a for loop, replace ‘n”t’ in each item,and than append to a new list.

import re
test_string =["ntt", "nttnttnttttt", "n", "n", "n", "n", "Do you like shopping?", "n", "Yes, Iu2019m a shopaholic.", "n", "What do you usually shop for?", "n", "I usually shop for clothes. Iu2019m a big fashion fan.", "n", "Where do you go shopping?", "n", "At some fashion boutiques in my neighborhood.", "n", "Are there many shops in your neighborhood?", "n", "Yes. My area is the city center, so I have many choices of where to shop.", "n", "Do you spend much money on shopping?", "n", "Yes and Iu2019m usually broke at the end of the month.", "n", "nnn", "n", "tttt", "ntttnttt", "nntttntttttttttttt"]
print(test_string)
        # remove t n    
a = re.compile(r'(t)+')     
b = re.compile(r'(n)+')
text = []
for n in test_string:
    n = a.sub('',n)
    n = b.sub('',n)
    text.append(n)
print(text)
        # remove all ''
while '' in text:
    text.remove('')
print(text)

The second method use map() and strip.The map() function directly processes the list and get the original format.’Unicode’ is used in python2 and changed to ‘str’ in python3, as following:

text = list(map(str.strip, test_string))
print(text)

The strip function only removes the ntr from the beginning and end of the string, not the middle of the string.It different from remove function.

Answered By: Ryan

Answer 6

If you want to preserve the list instead all joint strings, there is no need to add extra steps, you could just simply do call the getall() instead get():

response.xpath('normalize-space(.//td[@class="price"]/text())').getall()

Also, you should add the text() at the end.

Hope it helps anybody!

Answered By: Cesar Flores

Answer 7

You can try to use css combined with get().strip(), it works for me

Answered By: Hao

Answer 8

str(i.css("p::text")[1].extract()).strip()

Answered By: Suny Rajput

Strip n t r in scrapy

Question:

Answers: