Acquiring basic skills working with visualizing/analyzing large data sets

Question:

I’m looking for a way to learn to be comfortable with large data sets. I’m a university student, so everything I do is of “nice” size and complexity. Working on a research project with a professor this semester, and I’ve had to visualize relationships between a somewhat large (in my experience) data set. It was a 15 MB CSV file.

I wrote most of my data wrangling in Python, visualized using GNUPlot.

Are there any accessible books or websites on the subject out there? Bonus points for using Python, more bonus points for a more “basic” visualization system than relying on gnuplot. Cairo or something, I suppose.

Looking for something that takes me from data mining, to processing, to visualization.

EDIT: I’m more looking for something that will teach me the “big ideas”. I can write the code myself, but looking for techniques people use to deal with large data sets. I mean, my 15 MB is small enough where I can put everything I would ever need into memory and just start crunching. What do people do to visualize 5 GB data sets?

Asked By: Daniel Harms

||

Answers:

Check out Information is beautiful. It is not a technical book but it might give you a couple of ideas for visualising data.

And maybe have a look at the first 3 chapters of Principles of Data Mining, it goes through some concepts of visualizing data in data mining context, I found some parts of it useful during university.

Hope this helps

Answered By: Marcom

If you are looking for visualization rather than data mining and analysis, The Visual Display of Quantitative Information by Edward Tufte is considered one of the best books in the field.

Answered By: ktdrv

I’d say the most basic skill is a good grounding in math and statistics. This can help
you assess and pick from the variety of techniques for filtering data, and
reducing its volume and dimensionality while keeping its integrity. The last
thing you’d want to do is make something pretty that shows patterns or
relationships which aren’t really there.

Specialized math

To tackle some types of problems you’ll need to learn some math to understand how particular algorithms work and what effect they’ll have on your data. There are various algorithms for clustering data, dimensionality reduction, natural
language processing, etc. You may never use many of these, depending on the type of data you wish to analyze, but there are abundant resources on the Internet
(and Stack Exchange sites) should you need help.

For an introductory overview of data mining techniques, Witten’s Data Mining is good. I have the 1st edition, and it explains concepts in plain language with a bit of math thrown in. I recommend it because it provides a good overview and it’s not too expensive — as you read more into the field you’ll notice many of the books are quite expensive. The only drawback is a number of pages dedicated to using WEKA, an Java data mining package, which might not be too helpful as you’re using Python (but is open source, so you may be able to glean some ideas from the source code. I also found Introduction to Machine Learning to provide a good overview, also reasonably priced, with a bit more math.

Tools

For creating visualizations of your own invention, on a single machine, I think the basics should get you started: Python, Numpy, Scipy, Matplotlib, and a
good graphics library you have experience with, like PIL or
Pycairo. With these you can crunch numbers, plot them on graphs, and pretty things up via custom drawing routines.

When you want to create moving, interactive visualizations, tools like the
Java-based Processing library make this easy. There
are even ways of writing Processing sketches in
Python
via Jython, in case you don’t want to write Java.

There are many more tools out there, should you need them, like OpenCV (computer vision,
machine learning)
, Orange (data mining,
analysis, viz)
, and NLTK (natural language, text
analysis)
.

Presentation principles and techniques

Books by folks in the field like Edward
Tufte
and references like
Information
Graphics

can help you get a good overview of the ways of creating visualizations and
presenting them effectively.

Resources to find Viz examples

Websites like Flowing Data, Infosthetics, Visual Complexity and Information is
Beautiful
show recent, interesting
visualizations from across the web. You can also look through the many compiled lists of of visualization sites out there on the Internet. Start with these as a seed and start navigating around, I’m sure you’ll find a lot of useful sites and inspiring examples.

(This was originally going to be a comment, but grew too long)

Answered By: samplebias

I like the book Data Analysis with Open Source Tools by Janert. It is a pretty broad survey of data analysis methods, focusing on how to understand the system that produced the data, rather than on sophisticated statistical methods. One caveat: while the mathematics used isn’t especially advanced, I do think you will need to be comfortable with mathematical arguments to gain much from the book.

Answered By: Michael J. Barber