How do I begin analyzing data using Python?

Answer by William Chen:

Check out Harvard’s free data science course.

The homeworks (with solutions) walk you through a number of data analysis, mining, scraping, manipulation problems with Python and iPython notebook!

Check out Coursera’s free data science course

Link: Coursera

To specifically play with data science and python, check out their Twitter Sentiment Analysis in Python assignment.

Check out my more comprehensive answer at William Chen’s answer to How do I become a data scientist?

I curate material on learning data science at Learn Data Science. Follow that blog or my blog at Storytelling with Statistics to get updated of new content!
How do I begin analyzing data using Python?

Job at Google in 6 months recipe


TL;DR Read three books, watch two classes, solve problems at topcoder/codeforces/projecteuler. Links are below.

If you are interested in getting a job at Google, you’ve probably already read Steve Yegge’s famous post (if you didn’t, read it first). It assumes you have only a couple of weeks for preparation and focuses on topics. I used it as a checklist. This recipe focuses on resources; depending on your background, requires 3-6 months, and gives you pretty good chances to pass the interview even if you never knew the difference between a graph and a hash table.

After following the recipe you will

  • have basic math, algorithm and data structure knowledge required for the interview
  • change the way you think about problems, expressing them using mathematical models (it is not scary, rather fun)
  • possibly love learning. Here at Google it is a continuous process and not less intensive.
  • learn nothing about system design (see below)

Read More

THAW from Tangible Media Group on Vimeo.

THAW is a novel interaction system that allows a collocated large display and small handheld devices to seamlessly work together. The smartphone acts both as a physical interface and as an additional graphics layer for near-surface interaction on a computer screen. Our system enables accurate position tracking of a smartphone placed on or over any screen by displaying a 2D color pattern that is captured using the smartphone’s back-facing camera. The proposed technique can be implemented on existing devices without the need for additional hardware.

When companies such as Facebook, Google, YouTube,Twitter and Quora started off, were the founders aware from day one that they may change…

Answer by Dustin Moskovitz:

Of course not. It took at least 6 or 7.

Seriously though, Facebook was such a phenomenon *right away* at Harvard that 80% of the students were using it within the first week. It was very easy to see that there was nothing particularly special about Harvard that meant real-identity social networking would be successful there and not other places. We weren’t completely confident that we would be the ones to replicate the model, but we were absolutely certain that some product like Facebook would become extremely popular and change the world in all the ways it has.
When companies such as Facebook, Google, YouTube,Twitter and Quora started off, were the founders aware from day one that they may change…

Machine Learning with Scikit-Learn - Jake Vanderplas from PyData on Vimeo.

Scikit-learn is a popular Python machine learning library. In this tutorial, I’ll give an introduction to the core concepts of machine learning, using scikit-learn to demonstrate applications of these concepts on real-world datasets. We’ll cover some of the most powerful and popular supervised and unsupervised learning techniques, including classification and regression models like Support Vector Machines and Random Forests, clustering models like K Means and Gaussian Mixtures, and dimensionality reduction models like PCA and manifold learning. Throughout, I’ll emphasize the key features of the scikit-learn API, so that participants will be well-poised to begin exploring their own datasets using the wide array of algorithms implemented in scikit-learn.

Jake Vanderplas

Jake Vanderplas is an NSF Postdoctoral fellow working jointly in the Computer Science and Astronomy departments at the University of Washington. His research involves large-scale machine learning applications within astronomy and astrophysics. He is a maintainer of the Python packages Scikit-learn and Scipy, and regularly contributes to several of the other packages within the numpy/scipy ecosystem. He occasionally blogs about Python-related topics at Pythonic Perambulations -

What is PyData? is the home for all things related to the use of Python in data management and analysis. This site aims to make open source data science tools easily accessible by listing the links in one location. If you would like to submit a download link or any items to be listed in PyData News, please let us know at:

PyData conferences are a gathering of users and developers of data analysis tools in Python. The goals are to provide Python enthusiasts a place to share ideas and learn from each other about how best to apply the language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization.

We aim to be an accessible, community-driven conference, with tutorials for novices, advanced topical workshops for practitioners, and opportunities for package developers and users to meet in person.

A major goal of PyData events and conferences is to provide a venue for users across all the various domains of data analysis to share their experiences and their techniques, as well as highlight the triumphs and potential pitfalls of using Python for certain kinds of problems.

PyData is organized by NumFOCUS with the generous help and support of our sponsors. Proceeds from PyData are donated to NumFOCUS and used for the continued development of the open-source tools used by data scientists If you would like to volunteer to be a part of the PyData team contact us at:

A tutorial on metric learning with some recent advances from SF Bay Area Machine Learning on Vimeo.

Nakul Verma (Janelia Farm Research Campus, HHMI)


A tutorial on metric learning with some recent advances.

Goal of metric learning is to learn a notion of distance in the representation space that yields good prediction performance on data. In this tutorial we explore some classic ways one can efficiently find good metrics. Starting from the basics, we’ll cover classic techniques like Large Margin Nearest Neighbor (LMNN) and Information Theoretic Metric Learning (ITML) and discuss key principles what makes these techniques effective. We will also study some extensions and see how metric learning has helped in ranking problems (information retrieval) and large scale classification.

Speaker Bio:
Dr. Nakul Verma is a research specialist at Janelia Farm Research Campus, a center for conducting fundamental research in basic sciences, where he is developing novel statistical techniques to help biologists quantitatively analyze behavioral phenotypes in model organisms and better understand the underlying neuroscience and genetic principles. His interests include high dimensional data analysis and exploiting intrinsic structure in data to design effective learning algorithms. Previously Dr. Verma worked at Amazon as a research scientist developing risk assessment models for real-time fraud detection. Dr. Verma received his PhD in Computer Science from UC San Diego specializing in Machine Learning.

-Flurry for hosting
-Tommy Chheng for recording

Neural Networks for Machine Perception from SF Bay Area Machine Learning on Vimeo.

Main Talk: Neural Networks for Machine Perception
Speaker: Ilya Sutskever (Google)
Neural Networks are computational learning models that are loosely based on real neurons. They can learn to perform various tasks by iteratively adjusting their connections. Recently, Neural Networks have enjoyed considerable success in speech recognition and visual object recognition. In this introductory talk, I will explain how neural networks learn and why they succeed, then describe how they’ve been used to achieve true state-of-the-art results on speech and visual object recognition.

Lightning Talk: Data Science at Flurry
Speaker: Soups Ranjan (Flurry)
Flurry provides mobile app analytics and mobile advertising products to app developers. In this talk I will provide insights in to how our Data Science team applies machine learning to a variety of problems, including Ad Revenue Optimization, Real-Time Bidding (RTB) strategy to purchase ad inventory programmatically, and Recommender Systems.

Special thanks:
-Flurry for hosting!
-CayMay Education for recording the event.

Data Workflows for Machine Learning from SF Bay Area Machine Learning on Vimeo.

Speaker: Paco Nathan


We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for “best of breed” and what features would be great to see across the board for many frameworks… leading up to a “scorecard” to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.

Speaker bio:

Paco Nathan, is a “player/coach” who’s led innovative Data teams building large-scale apps for 10+ years, and worked as an OSS evangelist for the past 2+ years. Expert in distributed systems, machine learning, cloud computing, functional programming — with a focus on Enterprise data workflows. Paco is an O’Reilly author, and an advisor for several firms including The Data Guild andZettacap. Paco received his BS Math Sci and MS Comp Sci degrees from Stanford University, and has 30+ years technology industry experience ranging from Bell Labs to early-stage start-ups.


Special thanks to The Climate Corporation for hosting the event, and Tommy Chheng for recording.

Mathias Brandewinder - F# and Machine Learning: a winning combination from NDC Conferences on Vimeo.

While Machine Learning practitioners routinely use a wide range of tools and languages, C# is conspicuously absent from that arsenal. Is .NET inadequate for Machine Learning? In this talk, I’ll argue that it can be a great fit, as long as you use the right language for the job, namely F#.
F# is a functional-first language, with a concise and expressive syntax that will feel familiar to data scientists used to Python or Matlab. It combines the performance and maintainability benefits of statically typed languages, with the flexibility of Type Providers, a unique mechanism that enables seamless consumption of virtually any data source. And as a first-class .NET citizen, it interops smoothly with C#. So if you are interested in a language that can handle both flexible data exploration and the pressure of a real production system, come check out what F# has to offer