Eli Sander

PyData Global 2021 slides

Thu, 28 Oct 2021 00:00:00 +0000

I gave a talk today at PyData Global about how Git works under the hood. If you want to think through how to build your own version control system, and learn more about how Git stores things and why, check out my slides here!

Think Like Git

ODSC West 2020 slides

Wed, 28 Oct 2020 00:00:00 +0000

Earlier today, I gave a talk on Business Skills for Data Science at ODSC West (virtual). There are lots of talks about technical skills for data scientists (I’ve given them myself), but I think business skills are equally important in doing impactful work. Check out my slides here:

Business Skills for Data Scientists

PyCon 2019 slides

Tue, 07 May 2019 00:00:00 +0000

I went to my third PyCon last weekend, and gave a talk about pre-mortems and post-mortems, and how teams can use them to learn from failure. As promised in that talk, I’m making my slides available for anyone who wants to reference them. Those slides are here:

Lowering the Stakes of Failure with Pre-mortems and Post-mortems

There’s a couple of bonus slides at the end, which didn’t make it into my presentation. Two years out of grad school, I can’t quite keep myself from making appendix slides.

You can watch the talk itself on YouTube:

You might have noticed that I haven’t posted anything in… quite a while. It was much easier to find time to write technical blog posts on a grad student schedule! I have done some writing for the Civis blog, though:

Hacktoberfest At Civis

Civis R&D Bookshelf: Learning A New Language

Hacktoberfest 2017 Recap: joblib, conda-forge, python-lambda, and more

PyCon 2018: A Roundup of Our Favorite Talks

Civis R&D Bookshelf: Open source, Python integers, and applied predictive modeling

Civis R&D Bookshelf: Project Management Edition

I also gave a talk about scikit-learn at PyCon 2018. I never posted the slides or YouTube link on my site, but better late than never!

Software Library APIs: Lessons Learned from scikit-learn

Basic Tools for Tuning Heuristic Optimizers

Tue, 18 Oct 2016 00:00:00 +0000

Note: This post is a follow-up to another post I wrote, which is a more general introduction to heuristic optimization algorithms. I recommend you read my earlier post before this one.

I wrote this post with helpful input from the awesome ladies at a Chicago Write/Speak/Code meetup! Kara Carrell wrote a great post on The Four A’s, and I’ll link to what others wrote as it comes online.

Let’s say you have a optimization problem you want to solve. You can calculate how good a given solution is (you have a fitness function), but you don’t know anything about the fitness landscape. No problem; you can use a heuristic optimization algorithm to find a solution. Maybe you’ve already chosen an algorithm to work with (say, simulated annealing). Now you’re good to go, right? Time to run it and get the solution you need?

Well, not exactly. Turns out, if you actually want to use heuristic optimizers, you have a lot of decisions to make, and lot of things to tune. Here are the ones you’ll run into over and over:

Stop conditions (number of steps/run time)
Fixed parameters or parameter sets (mutation rate for genetic algorithms, starting/ending temperatures for simulated annealing, etc)
Functions (mutation, recombination, selection)

To ground this with an example, here’s what you need to tune for a fairly minimal simulated annealing algorithm:

Number of steps
Starting temperature
Ending temperature
Cooling schedule (the function used to change the temperature as the algorithm runs)
Mutation algorithm

More sophisticated algorithms can easily require more tuning. And even for these five elements, it’s not necessarily obvious how to go about choosing these things.

To make matters worse, it turns out that tuning is really important to finding good solutions. A good mutation function can be the difference between finding the global optimum and getting stuck in a completely inferior region of the solution space. Other tuning choices can also have a huge impact on the solutions you find. But unless someone has happened to study your exact problem before¹, you probably don’t have any a priori way to make these decisions.

But there is hope! As it turns out, you can solve these problems using a few basic strategies: grid search, random sweeps, and convergence plots.

Grid Search

Let’s use the simulated annealing example, and try to tune just two things: the starting and ending temperatures. To run a grid search, simply choose a few possible values for each parameter, than try each combination. Even if you have a continuous range of possible parameter values, you can sample a wide range to get a sense of what is most promising. If you want to tune your optimizer further, you can run a grid search, find the most successful parameter combination, sample more points around the parameter set you’ve chosen, and repeat.

Grid search works well for parameters, but you can also use it for functions. Choose a few candidate mutation functions, and add those to the grid. In particular, it’s good to try mutation functions that work in very different ways; I have often been surprised at which mutation function works best for a given problem.

The main pitfall of this strategy is time. Grid search is combinatorially explosive! Each parameter you’re tuning adds an additional dimension to the grid, so the number of combinations you need to test can blow up quickly. If you’re tuning many parameters at once, make sure not to look at too many possible values per parameter. If I’m tuning a lot of things, I try to stick to 3 options per parameter.

Random Sweeps

Grid search is the strategy I’ve used in the past, but after reading Ahmed El Deeb’s post on parameter sweeps, I’ve come to think that random sweeps are a better way. I highly recommend you read Ahmed’s clear and concise post on the subject, but in essence, random search lets you explore more values for each parameter, which is an especially big improvement over grid search if you have parameters that differ in importance (which is often true, even if you don’t know which are important ahead of time).

The main reason I’m still mentioning grid search is that it’s your only option for tuning functions. In general, I’d recommend a mixed approach: use a grid search for functions (including cooling schedules) in combination with a randomized approach for individual parameters. Note that parameters and functions are not necessarily independent of each other, in terms of how they affect how well the algorithm runs. If you have the computational resources, it’s worth it to tune all parameters/functions at once, rather than varying one thing at a time.

Convergence Plots

Grid search and random sweeps are a great way to figure out what combinations to try, but how do you know which combinations work best? This is where convergence plots come in. During every tuning run of your algorithm, save the fitness of the best solution every so often (every 1000 or 10000 steps is usually good, depending on how fast the algorithm runs). If you plot number of steps against solution fitness, you should get a plot that looks something like this:

You’ll notice that the solution usually improves quickly at first, then more slowly as it converges. The shape of the curve can vary a lot based on the algorithm and parameter set. Algorithms that focus on exploitation tend to converge quickly, while those that focus on exploration tend to converge more slowly, often converging on better solutions overall. Parameter and function choices can also affect the shape of these curves. You can use these plots to answer many different questions:

Has the algorithm converged? If the slope is still decreasing, you probably need to relax your stop condition (i.e., run the algorithm for more steps). In my experience, this is the best way to tune your stop condition. In the example plot above, you can see that the the solution is still improving slowly, so it’s probably worth running a while longer.
How variable is the solution quality? That is, if you run the algorithm with the same parameter set but different random seeds, how different are the curves? Depending on your use case, consistency may be more important than finding the “best” solution.
How do different parameter sets (or different algorithms) compare? It’s possible to tune parameters based only on the final solution quality, but convergence curves provide additional information about how a parameterized algorithm is searching the space. Does a certain parameter combination help the algorithm explore more effectively? Does a different parameter combination help it converge quickly?
Is there anything wrong? If an algorithm is converging almost immediately, there’s probably something wrong with the implementation, or the mutation function you’ve chosen. Here, convergence plots can act as a simple diagnostic tool.

Conclusion

Heuristic optimization algorithms can be frustrating, especially when you’re trying to tune them. These algorithms are very flexible, which is a blessing and a curse. You can use them to solve all kinds of problems, but you pay the cost of having to tweak them quite a bit if you want something that works really well for your problem. This is the fundamental trade-off you make: few assumptions, but few guarantees. Lots of ways to customize and tune, but…. lots of ways to customize and tune.

Although it can be time-consuming, if you’re going to use an algorithm a lot, or if you really need the best solution possible, it’s worth putting some time into tuning. Fortunately, grid search, random sweeps, and convergence plots can get the job done. There are a few specific considerations to keep in mind when tuning different parameters and functions. Stay tuned (ha!) for a future post on this topic!

1 Algorithm tuning is a thing that people have studied. For very specific problems. Have they studied it for your problem? Maybe. It’s worth a look! If you can identify that your problem is fundamentally a knapsack problem, or a travelling salesman problem, or another classic well-defined problem, then a Google Scholar search can let you see what others have done, and at least get some good ideas for parameter choices and mutation functions. But even in this basic scenario, you will at least want to tune your stop conditions, since the size of your problem is likely to be different from the one studied.

My Talk at PyData Chicago 2016

Mon, 29 Aug 2016 00:00:00 +0000

Last weekend I gave a talk at PyData Chicago! It was called “Evolutionary Algorithms: Perfecting the Art of ‘Good Enough’” (props to my thesis advisor, Stefano Allesina, for the catchy title). It was heavily based on my blog post and workshops I’ve given. Heuristic optimizers are a fun topic for me because they are so general and useful, but they’re not really a hot topic, and I think a lot of people have just never encountered them. They’re a great addition to a data scientist’s toolkit.

It was my first tech talk, and a lot of fun. It was very different from giving an academic talk. It was nice to give a talk that was about getting people excited about a technique, rather than to present results of a specific study. I’m very proud of the fact that I got to talk about managing Skyrim loot as an example of an optimization problem (it’s quite literally a knapsack problem!).

You can take a look at the slides here:

Evolutionary Algorithms: Perfecting the Art of “Good Enough”

And the talk itself here:

Collaborating with your future self using Markdown documents

Fri, 03 Jun 2016 00:00:00 +0000

The pipeline for an analysis project can get complicated and confusing, especially if you’re simulating your own data. I often create pipelines with several different scripts in different languages, but it’s easy to forget a step. But a couple of months ago, I wrote myself a little Markdown file that looks something like this*:

0. Format data and subsample to build training set
**./RCode/BuildTrainingSet.R**
(for dataset B, use **./RCode/BuildTrainingSetB.R**)

1. Submit several runs of genetic algorithm to computing cluster
**./PythonCode/LaunchGA.py**

2. Check convergence
**./PythonCode/CheckConvergence.py**
(if unconverged, return to 1)

3. Calculate mutual information for best solution
**./RCode/MutualInformation.R**

4. Visualize using alluvial diagrams
**TBD**

This is simply a numbered list of what I need to do to run my entire analysis from start to finish, complete with paths to the relevant code, and small notes to myself. I used to trust myself to remember this pipeline. I mean, I came up with the pipeline, and I commented the code, and I gave my files informative names. But when I analyze data, I end up writing lots of small scripts, to explore the structure of the data and to test assumptions. By the time I need to rerun the analysis, I’ve lost track of the files I need to run, and the result involves opening and reading a bunch of files, mixed with some trial and error running them.

But why rerun the analysis at all? If you’ve ever published an scientific article, you probably know why. Reviewers always have changes to make and model variants to test. This often involves changing a few lines, then running the entire analysis over again. But even outside of academia, if you work with data, this is a problem that you’ll run into. You’ll learn something new about your dataset, or want to try a different prior or a different algorithm. Writing things down in a structured way has saved me hours during that process.

There’s another good reason to write up your process like this. Writing things down step by step forces you to think through the work that you’ve done. Why did you write that script? Does this pipeline make sense? Are there steps that you did by hand that you need to make note of, or automate? Making the process clear in your head will make it clearer when you present your work to someone else. It also provides a convenient outline for the methods section if you’re writing a scientific article.

Why a Markdown file in particular? Well, you could use HTML, or a simple text file, but I like Markdown for a couple of reasons. First, it is very simple. If you don’t know how to use it, you can read an overview and learn in about five minutes. It also plays well with code hosting sites like GitHub and Bitbucket. If you open the file in github, it will render nicely. But it is also easy to read in a text editor, because the markup is less intrusive than LaTeX or HTML.

I only recently started doing these little Markdown write-ups, because I somehow believed that I would always remember what my code did. After all, I wrote it. But after a few months have passed, anyone could have written that code. In essence, I’m collaborating with past me, except that past me never responds to emails or answers my perfectly reasonable questions.

Now, rather than dealing with the crappy collaborator that is past me, I like to think about how future me will think of me as a collaborator. Future me is also kind of a pain to collaborate with. She expects me to know in advance what questions she has, and she forgets so much of what I worked on. But it’s not hard to collaborate well with your future self. You already have a record of your analysis: the code in your repository. The Markdown notes simply act as a table of contents. Write one for each of your data pipelines, and your future self will thank you.

#####* This example is a simplified version of a project I’m working on right now. The details aren’t important, but including this level of detail in your own write-ups will make it more useful to you.

Capturing Shell Output in R and Python

Thu, 31 Mar 2016 00:00:00 +0000

Sometimes I spend significant time in R or Python trying to do something which is trivial is bash. This is especially useful when I’m working with very large files that will take a long time to read in. Why read in an entire file to get the last line, when I could just use tail -n 1? Or if I want the line count, why read it in when wc -l will get the job done faster?

It turns out that it’s not too complicated to capture shell output in R or Python. Here’s how I do it.

Python

If you use Python 3, capturing shell output is pretty simple (if you’re still on Python 2, the tides are turning! It’s time to make the change!). You can use the subprocess module to get the output in bytes, then decode and parse it.

import subprocess

## Get the last line of the file 'fname'
last_line = subprocess.check_output("tail -n 1 " + fname, shell = True)
## convert to string and parse
## 'UTF-8' is a common encoding, but you may need to use something else
last_line = last_line.decode('UTF-8').strip()

R

R makes this process easy too. You may have used system() before to submit shell commands. It turns out that if you set the argument intern = TRUE, you’ll get the output as a character vector– you don’t even have to deal with encoding! The output may take some parsing, but the stringr package is good for that.

require(stringr)
## Get the last line of the file 'fname'
lastLine = system(stringr::str_c("tail -n 1 ", fname), intern = TRUE)
## strip leading/trailing whitespace
lastLine = stringr::str_trim(lastLine)

This has saved me from reinventing the wheel many times since I learned it. Hopefully it helps you too!

I Published a Thing in Code Words!

Thu, 17 Mar 2016 00:00:00 +0000

The Recurse Center puts out a quarterly publication called Code Words, which publishes articles that try to capture the fun of digging into a problem and learning about programming. I wrote a piece on the grammar of graphics, and how it can provide a language for exploring and talking about data visualization. It uses examples from R’s ggplot package, but the ideas are more general. Check out my article, and other great articles from RC alums, in the sixth issue of Code Words!

What '.' Means in R, and Why it Matters

Thu, 10 Dec 2015 00:00:00 +0000

As far as I can tell, the R community has no generally-accepted style guide. Google and Hadley Wickham both have style guides, but across and even within CRAN packages, different naming and spacing conventions abound. You’re likely to find variables named in camelCase, snake_case, or, interestingly, dot.case. This last convention is unusual, because unlike many languages, R does not enforce specific syntactic meaning for dots. Dots can denote methods for S3 classes, but they don’t have to. This means that R only cares about dots sometimes, with confusing results.

S3 generic functions, like print, use the function UseMethod to dispatch the appropriate method for the data type. The methods are named using the construction function.class (i.e., print.lm, print.data.frame). Sometimes you can even call the methods directly, without the generic. For example:

> print('hi')
[1] "hi"
> print.data.frame('hi')
NULL
<0 rows> (or 0-length row.names)

The character object "hi" isn’t a data.frame, but R will try to call it like a normal function anyway. S3 is very unstructured in this way. The problem is that R can’t always distinguish between S3 generics/methods and dot case functions:

> example <- function(x) print(mean(x))
> example.two <- function(x) print(sum(x))
> methods(example)
[1] example.two
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
function 'example' appears not to be S3 generic; found functions that look like S3 methods 

Here I’ve defined two unrelated functions, but R isn’t sure what to do with them. example isn’t an S3 generic, because I don’t call UseMethod to dispatch example methods for different classes, but because of S3 naming structure, example.two looks like an S3 method on the class two. S3 is so informal that there is no checking to distinguish between the possibilites, and the methods() function lists example.two as a method (although at least it warns you that it’s not sure).

This behavior is both unexpected and problematic. If you’re relying on a specific behavior from methods(), someone else’s package (or your own functions!) could give you unreliable results. There is no reason to use dot case except as convention, and it’s not a set convention, even in base R. There are even base R functions that use a different case for the function name and arguments (colSums (x, na.rm = FALSE, dims = 1))! I’d love to see an R overhaul, where a new R version is released with a consistent style and an accompanying style guide. But until that time, R users should at least stop using dot case.

Creating and populating a database using Python and SQLalchemy. Part 2: Classes and queries

Tue, 08 Sep 2015 00:00:00 +0000

Last month I wrote a post on the SQLalchemy engine and session. Now I’m going to describe how you can set up a mapping for your schema so that you can populate and query your database.

Setting up your Schema

Setting up your schema correctly is what will allow you to get the most out of SQLalchemy. I’ll assume you’ve already set up your session and that it’s stored in the object session. After this, you’ll want to import some functions:

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, ForeignKey, Integer, String, Float, Boolean
from sqlalchemy import Index
from sqlalchemy.orm import relationship, backref

Base = declarative_base()

All I’ve done here besides importing is to set up Base. This is a basic table class provided by SQLalchemy. When you set up your own table classes, they will inherit from Base, so that many basic table methods will automatically be available to you.

Continuing with my example from the previous post, I’ll be using code snippets that I used to convert the Yelp academic dataset from JSON files to a PostgreSQL relational database. Here is what a basic table class looks like:

class Restaurant(Base):
	__tablename__ = 'restaurant'
	restaurant_id = Column(String(250), index = True, primary_key = True)
	ages_allowed = Column(String(250))
	price_range = Column(Integer)

This code is fairly readable; I’ve told SQLalchemy to call this table restaurant, and I’ve given it the names and types of a few columns. The restaurant_id column is the primary key for this table, and I’ve also created an index for the column, to make queries more efficient.

Basic Table Relationships

If your schema includes multiple tables, you will probably want to establish relationships between them. For example, I also created a table of restaurant reviews, called review, using a class much like the one above, called Review. This is a many-to-one relationship, since a restaurant may have multiple reviews, but a review will only be associated with a single restaurant. If I want to establish this relationship through the ORM, my Restaurant class will have an additional line:

class Restaurant(Base):
	__tablename__ = 'restaurant'
	restaurant_id = Column(String(250), index = True, primary_key = True)
	ages_allowed = Column(String(250))
	price_range = Column(Integer)
    reviews = relationship('Review', backref = 'restaurant')

This last line creates a .reviews attribute for Restaurant. The backref argument also creates a .restaurant attribute for class Review. This is syntactic sugar that allows me to set up the whole relationship in this line, without specifying the relationship in the class Review.

Many to many relationships are slightly more complicated. In my database, a restaurant may be associated with many categories (Cafe, Italian, Chinese Food, etc.), and each category will be associated with many restaurants. This means that I need to set up a table for categories, and a junction table that joins the restaurant and category information. I also need to establish relationships between the restaurant/category tables and the junction table, as follows:

class Restaurant(Base):
	__tablename__ = 'restaurant'
	restaurant_id = Column(String(250), index = True, primary_key = True)
	ages_allowed = Column(String(250))
	price_range = Column(Integer)
	categories = relationship('Category', secondary = 'restaurant_category')

class Category(Base):
	__tablename__  = 'category'
	category_id = Column(Integer, primary_key = True)
	restaurants = relationship('Restaurant', secondary = 'restaurant_category')
	name = Column(String(250), nullable = False)

class Restaurant_Category(Base):
	__tablename__ = 'restaurant_category'
	category_id = Column(Integer, ForeignKey('category.category_id'),
	                     primary_key = True)
	restaurant_id = Column(String(250), ForeignKey('restaurant.restaurant_id'),
						   primary_key = True)

The gist of this is that I set up a separate junction table in SQLalchemy, which specifies that its columns are foreign keys, and that they form a composite primary key for the table (by setting primary_key = True for both columns). I also set up a relationship for both Restaurant and Category, telling SQLalchemy that this relationship is specified by a secondary table, in this case restaurant_category.

Populating and Querying your Database

To create these tables in your database of choice, take your engine object and run:

Base.metadata.create_all(engine)

Populating the database is pretty easy, once you’ve set up the schema. Adding a category is as simple as:

category = Category(name=item)
session.add(category)

If you want to add a restaurant and link it to the category, you can append the category to the Restaurant object:

restaurant = Restaurant(restaurant_id = 'ABC123', price_range = 1)
restaurant.categories.append(category)
session.add(restaurant)
session.commit()

Querying tables with the ORM is a powerful way to work with your database, but it takes some getting used to.

category = session.query(Category)

A basic query on a class like the one above is equivalent to the SQL SELECT * FROM category. Note that it returns a query object, so that the query can be refined over multiple lines.

category = category.filter(Category.name == 'Cafe')

This line will find all category rows with the name ‘Cafe’. Note that this can also be run as a single line:

category = session.query(Category).filter(Category.name == 'Cafe')

There are many other ways to filter and adapt your queries, many of which are listed in this tutorial. If you set up logging, you can look at the actual SQL that is being run, which can help you debug and improve your queries.

The aspect of SQLalchemy I found most confusing is figuring out how to access actual table information from the query object. There are a few approaches to this. If there are multiple rows that match your query, you can iterate over them or put them all in a list:

for row in category:
	print row.name

## this is equivalent to the code above,
## but stores each Category object in a list
cats = category.all()
for cat in cats:
    print cat.name

If you only want a single row from your query, you can use category.first() to get the first match. It’s important to note that the query object is giving you rows in the form of a Category object. We can use these objects to take advantage of all of the relationships we set up before:

mycategory = category.first()
## let's get all of the restaurants associated with this category:
category_rest = mycategory.restaurants
## what is the first restaurant that matches?
firstrest = category_rest[0]
## what are all of the categories associated with this restaurant?
some_categories = firstrest.categories
for cat in some_categories:
    print cat.name

As you can see, you can do some complex and recursive things with these objects! If you have a complicated schema, the overhead of setting up the ORM in SQLalchemy is, in my opinion, really worth it.