Does sqlite3 compress data?

Question:

I’ve got an 7.4Gb csv file. After converting it to a sqlite database with a python script the output DB is 4.7Gb, around 60% of the original size.

The csv has around 150,000,000 rows. It has header:

tkey,ipaddr,healthtime,numconnections,policystatus,activityflag

And each row looks something like

261846,172.10.28.15,2012-02-03 16:15:00,22,1,1

The script uses healthtime to split the data into tables 192 tables

When I first saw these numbers, I assumed I had made an error someplace. How big of a reduction in file size should I expect from the added efficiency of only writing the healthtime 192 times instead of 150,000,000 times?

EDIT: As soon as I posted this I realized the answer. I’m removing about 40% of the string, hence the 40% reduction in size.

Edit 2 Let’s calculate the difference in size between the plain text:

"261846,172.10.28.15,2012-02-03 16:15:00,22,1,1" 

And the data base entry:

db(261846,'172.10.28.15',22,1,1)

First of all, we drop from 46 to 26 characters in plain text representation.

The remaining characters are:

"261846,172.10.28.15,22,1,1"

or 26 bytes. If each integer needs to be stored in 32 bit (4 bytes), then we have:

12 bytes (ipaddr) + 4 bytes * 4 (integer fields) = 28 bytes.

So it looks like converting to integers makes the storage slightly less efficient, and all of my gains come form reducing the number of characters stored in each row.

Asked By: Maus

||

Answers:

SQLite is not running a compression algorithm, but it will store data in a binary file instead of a text file. Which means that the data can be stored more efficiently, for example using a 32-bit (4 byte) number to represent 10,000,000 instead of storing it as 8 bytes of text (or more if the file is unicode).

Here are more details on the SQL Database File Format if you are interested.

Does that make sense?

Answered By: Justin Ethier

SQLite, by default, does not compress data it writes to the disk; however, SQLite does have a set of “Proprietary Extensions” for that and other purposes. Look for ZIPVFS in the links as follows.

http://www.sqlite.org/support.html and
http://www.hwaci.com/sw/sqlite/prosupport.html

You can achieve a lot of “compression” in your data by encoding fields as integers. For example an IP address was engineered to fit into a word (4 bytes). Each octect of the address may be represented in one byte of a word.

string[] octets = '172.168.0.1'.split('.')
int ip = atoi(octets[0]) << 24
ip |= atoi(octets[1]) << 16
ip |= atoi(octets[2]) << 8
ip |= atoi(octets[3])

Additionally, your timestamp may be represented in Unix time, which is the number of seconds since the epoch.

UPDATE mytable SET healthtime = CAST(strftime('%s',healthtime) AS INTEGER);

See the Date and Time functions

Note the CAST directive in the above SQL: SQLite does not enforce type on a column, so you may have a group of digits stored as a string; increasing your field size more than necessary (this will also make certain queries behave oddly).

One more thing: field size is not the only part of the story. Remember that indexes take up a space too, and indexes on integers are more efficient –in terms of disk size and performance.

Answered By: Arik

I have a rather big SQLite file in my GIT repository and I was wondering why my total repository size was not growing that much but instead, was even smaller than my SQLite-.db-file. Turns out, GIT compresses the repository by default. A quick check on my .db-file also indicates this, since zipping the .db-file resulted in a zip-archive which was only ~20% the size of the .db-file.

So, at least with the default settings, it doesn’t look like SQLite is storing the data in a compressed way. However, regardless of that, it might be fine to add a big SQLite file to a GIT repository, since GIT performs compression automatically.

Answered By: j00hi
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.