How to open a huge excel file efficiently

Question:

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)

Is there any way to open the excel file quicker? I’m open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I’m looking for a way to open the file in under 30 seconds if that’s not a pipe dream. Also, the above example is using python, but it doesn’t have to be python.


Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file


UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

Asked By: David542

||

Answers:

Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?

To do this you need to pass on_demand=True to open_workbook().

xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper
name=” mode=’w’ encoding=’UTF-8’>, verbosity=0, use_mmap=1,
file_contents=None, encoding_override=None, formatting_info=False,
on_demand=False, ragged_rows=False)


Other potential python solutions I found for reading an xlsx file:

  • Read the raw xml in ‘xl/sharedStrings.xml’ and ‘xl/worksheets/sheet1.xml’
  • Try the openpyxl library’s Read Only mode which claims too be optimized in memory usage for large files.

    from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']
    
    for row in ws.rows:
        for cell in row:
            print(cell.value)
    
  • If you are running on Windows you could use PyWin32 and ‘Excel.Application’

    import time
    import win32com.client as win32
    def excel():
       xl = win32.gencache.EnsureDispatch('Excel.Application')
       ss = xl.Workbooks.Add()
    ...
    
Answered By: mikek3332002

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA’s/Interop or Open XML SDK.

One way to get the data at a lower level (bypassing the middle layer) is using a Driver.

150MB one-sheet excel file that takes about 7 minutes.

The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:

Stopwatch sw = new Stopwatch();
sw.Start();

DataSet excelDataSet = new DataSet();

string filePath = @"c:tempBigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties="Excel 8.0;HDR=YES;"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties="Excel 12.0;HDR=YES;"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count);

enter image description here

Win 7×64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.

If I could recommend a hardware solution as well, try to resolve it with an SSD if you’re using standard HDD’s.

Note: I cant download your Excel spreadsheet example as I’m behind a corporate firewall.

PS. See MSDN – Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.

PS 2. Here’s how you can do it with python:
http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

Answered By: Jeremy Thompson

I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.

The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).

public List<List<object>> ParseSpreadSheet()
{
    List<List<object>> rows = new List<List<object>>();

    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

        OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

        Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();

        int i = 0;
        foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
        {
            sharedStringCache.Add(i++, el.InnerText);
        }

        while (reader.Read())
        {
            if(reader.ElementType == typeof(Row))
            {
                reader.ReadFirstChild();

                List<object> cells = new List<object>();

                do
                {
                    if (reader.ElementType == typeof(Cell))
                    {
                        Cell c = (Cell)reader.LoadCurrentElement();

                        if (c == null || c.DataType == null || !c.DataType.HasValue)
                            continue;

                        object value;

                        switch(c.DataType.Value)
                        {
                            case CellValues.Boolean:
                                value = bool.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Date:
                                value = DateTime.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Number:
                                value = double.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.InlineString:
                            case CellValues.String:
                                value = c.CellValue.InnerText;
                                break;
                            case CellValues.SharedString:
                                value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                break;
                            default:
                                continue;
                        }

                        if (value != null)
                            cells.Add(value);
                    }

                } while (reader.ReadNextSibling());

                if (cells.Any())
                    rows.Add(cells);
            }
        }
    }

    return rows;
}

I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.

Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.

The trick was to use the SAX approach as explained here:

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

Answered By: Isma

Python’s Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().

One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:full pathBigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:full pathtemp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:full pathtemp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)

Column types
The types for your columns can be specified by passing dtype and converters and parse_dates:

df = pd.read_csv(r'c:full pathtemp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.

nfer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be
inferred, switch to a faster method of parsing them. In some cases
this can increase the parsing speed by 5-10x.

Also add dayfirst=True if dates are in the form DD/MM/YYYY.

Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:

df = pd.read_csv(r'c:full pathtemp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

The resulting dataframe would then only contain those 3 columns of data.

RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.

Note: This does assume you are using a Windows PC with Excel available.

Answered By: Martin Evans

Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:

Intel i5 4460, 12 GB RAM, SSD Samsung EVO PRO.

If you have a lot of memory ram:
This code needs a lot of ram, but it takes 20~25 seconds. (You need the parameter -Xmx7g)

package com.devsaki.opensimpleexcel;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Multithread {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);
        long init = System.currentTimeMillis();
        ExecutorService executor = Executors.newFixedThreadPool(4);
        char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
        char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));

        Object[][] sheet = futureSheet1.get();
        String[] words = futureWords.get();

        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("only read: " + (end - init) / 1000);

        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static String readEntry(ZipFile zipFile, String entry) throws IOException {
        System.out.println("Initialing readEntry " + entry);
        long init = System.currentTimeMillis();
        String result = null;

        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            br.readLine();
            result = br.readLine();
        }

        long end = System.currentTimeMillis();
        System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
        return result;
    }


    public static String[] processSharedStrings(CharReader br) throws IOException {
        System.out.println("Initialing processSharedStrings");
        long init = System.currentTimeMillis();
        String[] words = null;
        char[] wordCount = "Count="".toCharArray();
        char[] token = "<t>".toCharArray();
        String uniqueCount = extractNextValue(br, wordCount, '"');
        words = new String[Integer.parseInt(uniqueCount)];
        String nextWord;
        int currentIndex = 0;
        while ((nextWord = extractNextValue(br, token, '<')) != null) {
            words[currentIndex++] = nextWord;
            br.skip(11); //you can skip at least 11 chars "/t></si><si>"
        }
        long end = System.currentTimeMillis();
        System.out.println("SharedStrings: " + (end - init) / 1000);
        return words;
    }


    public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
        System.out.println("Initialing processSheet1");
        long init = System.currentTimeMillis();
        char[] dimensionToken = "dimension ref="".toCharArray();
        String dimension = extractNextValue(br, dimensionToken, '"');
        int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
        br.skip(30); //Between dimension and next tag c exists more or less 30 chars
        Object[][] result = new Object[sizes[0]][sizes[1]];

        int parallelProcess = 8;
        int currentIndex = br.currentIndex;
        CharReader[] charReaders = new CharReader[parallelProcess];
        int totalChars = Math.round(br.chars.length / parallelProcess);
        for (int i = 0; i < parallelProcess; i++) {
            int endIndex = currentIndex + totalChars;
            charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
            currentIndex = endIndex;
        }
        Future[] futures = new Future[parallelProcess];
        for (int i = charReaders.length - 1; i >= 0; i--) {
            final int j = i;
            futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
        }
        for (Future future : futures) {
            future.get();
        }

        long end = System.currentTimeMillis();
        System.out.println("Sheet1: " + (end - init) / 1000);
        return result;
    }

    public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
        System.out.println("Initialing inParallelProcess : " + br.identifier);

        char[] tokenOpenC = "<c r="".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s="".toCharArray();
        char[] tokenAttributT = " t="".toCharArray();

        String v;
        int firstCurrentIndex = br.currentIndex;
        boolean first = true;

        while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
            if (first && back != null) {
                int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
                first = false;
                System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
                back.plusLength(sum);
            }
            int[] indexes = extractSizeFromDimention(v);

            int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
            char type = 's'; //3 types: number (n), string (s) and date (d)
            if (s == 0) { // Token S = number or date
                char read = br.read();
                if (read == '1') {
                    type = 'n';
                } else {
                    type = 'd';
                }
            } else if (s == -1) {
                type = 'n';
            }
            String c = extractNextValue(br, tokenOpenV, '<');
            Object value = null;
            switch (type) {
                case 'n':
                    value = Double.parseDouble(c);
                    break;
                case 's':
                    try {
                        value = Integer.parseInt(c);
                    } catch (Exception ex) {
                        System.out.println("Identifier Error : " + br.identifier);
                    }
                    break;
                case 'd':
                    value = c.toString();
                    break;
            }
            result[indexes[0] - 1][indexes[1] - 1] = value;
            br.skip(7); ///v></c>
        }
    }

    static class CharReader {
        char[] chars;
        int currentIndex;
        int length;

        int identifier;

        public CharReader(char[] chars) {
            this.chars = chars;
            this.length = chars.length;
        }

        public CharReader(char[] chars, int currentIndex, int length, int identifier) {
            this.chars = chars;
            this.currentIndex = currentIndex;
            if (length > chars.length) {
                this.length = chars.length;
            } else {
                this.length = length;
            }
            this.identifier = identifier;
        }

        public void plusLength(int n) {
            if (this.length + n <= chars.length) {
                this.length += n;
            }
        }

        public char read() {
            if (currentIndex >= length) {
                return CHAR_END;
            }
            return chars[currentIndex++];
        }

        public void skip(int n) {
            currentIndex += n;
        }
    }


    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(CharReader br, char[] token, char until) {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }
}

Old answer (Not need the parameter Xms7g, so take less memory):
It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).

Here the code:
https://github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Launcher {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        long init = System.currentTimeMillis();
        String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);

        ExecutorService executor = Executors.newFixedThreadPool(4);
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
        String[] words = futureWords.get();
        Object[][] sheet1 = futureSheet1.get();
        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("Main only open and read: " + (end - init) / 1000);


        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet1) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
        String entry = "xl/worksheets/sheet1.xml";
        Object[][] result = null;
        char[] dimensionToken = "dimension ref="".toCharArray();
        char[] tokenOpenC = "<c r="".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s="".toCharArray();
        char[] tokenAttributT = " t="".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String dimension = extractNextValue(br, dimensionToken, '"');
            int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
            br.skip(30); //Between dimension and next tag c exists more or less 30 chars
            result = new Object[sizes[0]][sizes[1]];
            String v;
            while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
                int[] indexes = extractSizeFromDimention(v);

                int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
                char type = 's'; //3 types: number (n), string (s) and date (d)
                if (s == 0) { // Token S = number or date
                    char read = (char) br.read();
                    if (read == '1') {
                        type = 'n';
                    } else {
                        type = 'd';
                    }
                } else if (s == -1) {
                    type = 'n';
                }
                String c = extractNextValue(br, tokenOpenV, '<');
                Object value = null;
                switch (type) {
                    case 'n':
                        value = Double.parseDouble(c);
                        break;
                    case 's':
                        value = Integer.parseInt(c);
                        break;
                    case 'd':
                        value = c.toString();
                        break;
                }
                result[indexes[0] - 1][indexes[1] - 1] = value;
                br.skip(7); ///v></c>
            }
        }
        return result;
    }

    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
        String entry = "xl/sharedStrings.xml";
        String[] words = null;
        char[] wordCount = "Count="".toCharArray();
        char[] token = "<t>".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String uniqueCount = extractNextValue(br, wordCount, '"');
            words = new String[Integer.parseInt(uniqueCount)];
            String nextWord;
            int currentIndex = 0;
            while ((nextWord = extractNextValue(br, token, '<')) != null) {
                words[currentIndex++] = nextWord;
                br.skip(11); //you can skip at least 11 chars "/t></si><si>"
            }
        }
        return words;
    }

    public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = (char) br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = (char) br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }

}
Answered By: devsaki

Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):

from __future__ import print_function

from lxml import etree
import time


start_ts = time.time()

for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                            collect_ids=False, resolve_entities=False,
                            huge_tree=True):
    pass

print(time.time() - start_ts)

The sample output: 27.2134890556

By the way, the Excel itself needs about 40 seconds to load the workbook.

Answered By: void

I’m using a Dell Precision T1700 workstation and using c# I was able to open the file and read it’s contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.

My using statements:

using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;

Code to open and read workbook:

public partial class MainWindow : Window {
    public MainWindow() {
        InitializeComponent();

        Excel.Application xlApp;
        Excel.Workbook wb;
        Excel.Worksheet ws;

        xlApp = new Excel.Application();
        xlApp.Visible = false;
        xlApp.ScreenUpdating = false;

        wb = xlApp.Workbooks.Open(@"Desired Path of workbookCopy of BigSpreadsheet.xlsx");

        ws = wb.Sheets["Sheet1"];

        //string rng = ws.get_Range("A1").Value;
        MessageBox.Show(ws.get_Range("A1").Value);

        Marshal.FinalReleaseComObject(ws);

        wb.Close();
        Marshal.FinalReleaseComObject(wb);

        xlApp.Quit();
        Marshal.FinalReleaseComObject(xlApp);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}
Answered By: R. Roe

The c# and ole solution still have some bottleneck.So i test it by c++ and ado.

_bstr_t connStr(makeConnStr(excelFile, header).c_str());

TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));

while(!pRec->adoEOF)
{
    for(long i = 0; i < pRec->Fields->GetCount(); ++i)
    {   
        _variant_t v = pRec->Fields->GetItem(i)->Value;
        if(v.vt == VT_R8)
            num[i] = v.dblVal;
        if(v.vt == VT_BSTR)
            str[i] = v.bstrVal;          
        ++cellCount;
    }                                    
    pRec->MoveNext();
}

In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it’s possible for manipulating your data under 30s.

If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process.
I cannot download your data to test it,so please tell me the result.

Answered By: obgnaw

I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).

https://github.com/skadyan/largefile

This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.

The callback interface com.stackoverlfow.largefile.RecordHandler implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments

  • sheetname : String, excel sheet name
  • row number: int, row number of data
  • and data map: Map: excel cell reference and excel formatted cell value

The class com.stackoverlfow.largefile.Main demonstrate one basic implementation of this interface which just print the row number on console.

Update

woodstox parser seems have better performance than standard SAXReader. (code updated in repo).

Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.

Answered By: skadya

I would like to have more info about the system where you
are opening the file… anyway:

look in your system for a Windows update called

“Office File Validation Add-In for Office …”

if you have it… uninstall it…
the file should load much more quickly
specially if is loaded froma share

Answered By: ZEE

Another way that should improve largely the load/operation time is a RAMDrive

create a RAMDrive with enough space for your file and a 10%..20% extra space…
copy the file for the RAMDrive…
Load the file from there… depending on your drive and filesystem
the speed improvement should be huge…

My favourite is IMDisk toolkit

(https://sourceforge.net/projects/imdisk-toolkit/)
here you have a powerfull command line to script everything…

I also recommend SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

but that also depends of your OS…

Answered By: ZEE
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.