Scrapy csv output "randomly" missing fields

Question

My scrapy crawler correctly reads all fields as the debug output shows:

2017-01-29 02:45:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.willhaben.at/iad/immobilien/mietwohnungen/niederoesterreich/krems-an-der-donau/altbauwohnung-wg-geeignet-donaublick-189058451/>
{'Heizung': 'Gasheizung', 'whCode': '189058451', 'Teilmöbliert / Möbliert': True, 'Wohnfläche': '105', 'Objekttyp': 'Zimmer/WG', 'Preis': 1050.0, 'Miete (inkl. MWSt)': 890.0, 'Stockwerk(e)': '2', 'Böden': 'Laminat', 'Bautyp': 'Altbau', 'Zustand': 'Sehr gut/gut', 'Einbauküche': True, 'Zimmer': 3.0, 'Miete (exkl. MWSt)': 810.0, 'Befristung': 'nein', 'Verfügbar': 'ab sofort', 'zipcode': 3500, 'Gesamtbelastung': 1150.0}

but when I output the csv using the command line option

scrapy crawl mietwohnungen -o mietwohnungen.csv --logfile=mietwohnungen.log

some of the fields are missing, as the corresponding line from the output file shows:

Keller,whCode,Garten,Zimmer,Terrasse,Wohnfläche,Parkplatz,Objekttyp,Befristung,zipcode,Preis
,189058451,,3.0,,105,,Zimmer/WG,nein,3500,1050.0

The fields missing in the example are: Heizung, Teilmöbliert / Möbliert, Miete (inkl. MWSt), Stockwerk(e), Böden, Bautyp, Zustand, Einbauküche, Miete (exkl. MWSt), Verfügbar, Gesamtbelastung

This happens with a few values that I scrape. One thing to note is that not every page contains the same fields, hence I generate the field names depending on the page. I create a dict containing all the fields present and yield that in the end. This works as the DEBUG output shows. However, some csv columns don’t seem to be printed.

As you can see some columns are blank because other pages obviously have these fields (in the example ‘Keller’).

The scraper works if I use a smaller list to scrape (e.g. refine my initial search selection while keeping some of the problematic pages in the results):

Heizung,Zimmer,Bautyp,Gesamtbelastung,Einbauküche,Miete (exkl. MWSt),Zustand,Miete (inkl. MWSt),zipcode,Teilmöbliert / Möbliert,Objekttyp,Stockwerk(e),Böden,Befristung,Wohnfläche,whCode,Preis,Verfügbar
Gasheizung,3.0,Altbau,1150.0,True,810.0,Sehr gut/gut,890.0,3500,True,Zimmer/WG,2,Laminat,nein,105,189058451,1050.0,ab sofort

I have already changed to python3 to avoid any unicode string problems.

Is this a bug? This also seems to only affect the csv output, if I output to xml all fields are printed.

I don’t understand why it does not work with the full list. Is the only solution really to write a csv exporter manually?

Asked By: MoRe

||

Source

Answer 1

If you yielding scraped results as dict, CSV columns will be populated from the keys of first yielded dict:

def _write_headers_and_set_fields_to_export(self, item):
    if self.include_headers_line:
        if not self.fields_to_export:
            if isinstance(item, dict):
                # for dicts try using fields of the first item
                self.fields_to_export = list(item.keys())
            else:
                # use fields declared in Item
                self.fields_to_export = list(item.fields.keys())
        row = list(self._build_row(self.fields_to_export))
        self.csv_writer.writerow(row)

So you should either define and populate Item with all the fields defined explicitly, or write custom CSVItemExporter.

Answered By: mizhgun

Answer 2

Solution based on mizhgun’s answer:

I created an item pipeline that writes the csv output. When iterating through every item it stores the set of unique keys and writes the csv file in the end. Remember removing the -o option when calling scrapy crawl and adding the pipeline to settings.py:

pipelines.py

import csv
import logging

class CsvWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('mietwohnungen.csv', 'w', newline='')
        #if python < 3 use
        #self.file = open('mietwohnungen.csv', 'wb')
        self.items = []
        self.colnames = []

    def close_spider(self, spider):
        csvWriter = csv.DictWriter(self.file, fieldnames = self.colnames)#, delimiter=',')
        logging.info("HEADER: " + str(self.colnames))
        csvWriter.writeheader()
        for item in self.items:
            csvWriter.writerow(item)
        self.file.close()

    def process_item(self, item, spider):
        # add the new fields
        for f in item.keys():
            if f not in self.colnames:
                self.colnames.append(f)

        # add the item itself to the list
        self.items.append(item)
        return item

settings.py

ITEM_PIPELINES = {
    'willhaben.pipelines.CsvWriterPipeline': 300,
}

_{This answer was posted as an edit to the question Scrapy csv output "randomly" missing fields by the OP MoRe under CC BY-SA 3.0.}

Answered By: vvvvv

Scrapy csv output "randomly" missing fields

Question:

Answers: