<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Eli Sander</title>
    <description>Personal website</description>
    <link>/</link>
    <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Thu, 28 Oct 2021 18:23:56 +0000</pubDate>
    <lastBuildDate>Thu, 28 Oct 2021 18:23:56 +0000</lastBuildDate>
    <generator>Jekyll v3.9.0</generator>
    
      <item>
        <title>PyData Global 2021 slides</title>
        <description>&lt;p&gt;I gave a talk today at PyData Global about how Git works under the
hood. If you want to think through how to build your own version
control system, and learn more about how Git stores things and why,
check out my slides here!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.elisander.com/pdfs/PyData_2021.pdf&quot;&gt;Think Like Git&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Thu, 28 Oct 2021 00:00:00 +0000</pubDate>
        <link>/programming/2021/10/28/PyData-slides.html</link>
        <guid isPermaLink="true">/programming/2021/10/28/PyData-slides.html</guid>
        
        <category>Data Science</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>ODSC West 2020 slides</title>
        <description>&lt;p&gt;Earlier today, I gave a talk on Business Skills for Data Science at
ODSC West (virtual). There are lots of talks about technical skills
for data scientists (I’ve given them myself), but I think business
skills are equally important in doing impactful work. Check out my
slides here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.elisander.com/pdfs/ODSC_West_2020.pdf&quot;&gt;Business Skills for Data Scientists&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Wed, 28 Oct 2020 00:00:00 +0000</pubDate>
        <link>/programming/2020/10/28/ODSC-slides.html</link>
        <guid isPermaLink="true">/programming/2020/10/28/ODSC-slides.html</guid>
        
        <category>Data Science</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>PyCon 2019 slides</title>
        <description>&lt;p&gt;I went to my third PyCon last weekend, and gave a talk about pre-mortems and post-mortems, and how teams can use them to learn from failure. As promised in that talk, I’m making my slides available for anyone who wants to reference them. Those slides are here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.elisander.com/pdfs/PyCon_2019.pdf&quot;&gt;Lowering the Stakes of Failure with Pre-mortems and Post-mortems&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There’s a couple of bonus slides at the end, which didn’t make it into my presentation. Two years out of grad school, I can’t quite keep myself from making appendix slides.&lt;/p&gt;

&lt;p&gt;You can watch the talk itself on YouTube:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/bmMBA6SDirU&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;You might have noticed that I haven’t posted anything in… quite a while. It was much easier to find time to write technical blog posts on a grad student schedule! I have done some writing for the Civis blog, though:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/civis-analytics/hacktoberfest-at-civis-3b9c7d680b65&quot;&gt;Hacktoberfest At Civis&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/civis-analytics/civis-r-d-bookshelf-learning-a-new-language-b0d70634fc56&quot;&gt;Civis R&amp;amp;D Bookshelf: Learning A New Language&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/civis-analytics/hacktoberfest-2017-recap-joblib-conda-forge-python-lambda-and-more-e969d60a49e8&quot;&gt;Hacktoberfest 2017 Recap: joblib, conda-forge, python-lambda, and more
&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/civis-analytics/pycon-2018-a-roundup-of-our-favorite-talks-7a9ab3628f9d&quot;&gt;PyCon 2018: A Roundup of Our Favorite Talks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/civis-analytics/civis-bookshelf-open-source-python-integers-and-applied-predictive-modeling-d193c04903cb&quot;&gt;Civis R&amp;amp;D Bookshelf: Open source, Python integers, and applied predictive modeling
&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://medium.com/civis-analytics/civis-r-d-bookshelf-project-management-edition-348f8da5250e&quot;&gt;Civis R&amp;amp;D Bookshelf: Project Management Edition&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I also gave a talk about scikit-learn at PyCon 2018. I never posted the slides or YouTube link on my site, but better late than never!&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.elisander.com/pdfs/PyCon_2018.pdf&quot;&gt;Software Library APIs: Lessons Learned from scikit-learn&lt;/a&gt;&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/WCEXYvv-T5Q&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
</description>
        <pubDate>Tue, 07 May 2019 00:00:00 +0000</pubDate>
        <link>/programming/2019/05/07/PyCon-slides.html</link>
        <guid isPermaLink="true">/programming/2019/05/07/PyCon-slides.html</guid>
        
        <category>Python</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>Basic Tools for Tuning Heuristic Optimizers</title>
        <description>&lt;p&gt;&lt;em&gt;Note: This post is a follow-up to another post I wrote, which is a
 more general introduction to heuristic optimization algorithms. I
 recommend you read
 &lt;a href=&quot;http://www.elisander.com/programming/2015/08/04/Heuristic-Search-Algorithms.html&quot;&gt;my earlier post&lt;/a&gt;
 before this one.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I wrote this post with helpful input from the awesome ladies
  at a Chicago Write/Speak/Code meetup! Kara Carrell wrote a great
  post on &lt;a href=&quot;I'll add links to what
  everyone wrote&quot;&gt;The Four A’s&lt;/a&gt;, and I’ll link to what others wrote as it
  comes online.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let’s say you have a optimization problem you want to solve. You can
calculate how good a given solution is (you have a fitness function),
but you don’t know anything about the fitness landscape. No problem;
you can use a heuristic optimization algorithm to find a
solution. Maybe you’ve already chosen an algorithm to work with (say,
simulated annealing). Now you’re good to go, right? Time to run it and
get the solution you need?&lt;/p&gt;

&lt;p&gt;Well, not exactly. Turns out, if you actually want to use heuristic
optimizers, you have a lot of decisions to make, and lot of things to
tune. Here are the ones you’ll run into over and over:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Stop conditions (number of steps/run time)&lt;/li&gt;
  &lt;li&gt;Fixed parameters or parameter sets (mutation rate for genetic
algorithms, starting/ending temperatures for simulated annealing, etc)&lt;/li&gt;
  &lt;li&gt;Functions (mutation, recombination, selection)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To ground this with an example, here’s what you need to tune for a
fairly minimal simulated annealing algorithm:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Number of steps&lt;/li&gt;
  &lt;li&gt;Starting temperature&lt;/li&gt;
  &lt;li&gt;Ending temperature&lt;/li&gt;
  &lt;li&gt;Cooling schedule (the function used to change the temperature as the
algorithm runs)&lt;/li&gt;
  &lt;li&gt;Mutation algorithm&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More sophisticated algorithms can easily require more tuning. And even
for these five elements, it’s not necessarily obvious how to
go about choosing these things.&lt;/p&gt;

&lt;p&gt;To make matters worse, it turns out that tuning is &lt;em&gt;really important&lt;/em&gt; to finding good
solutions. A good mutation function can be the difference between
finding the global optimum and getting stuck in a completely inferior
region of the solution space. Other tuning choices can also have
a huge impact on the solutions you find. But unless someone has
happened to study your exact problem before&lt;sup&gt;&lt;a href=&quot;#footnote1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;,
you probably don’t have any &lt;em&gt;a priori&lt;/em&gt; way to make these decisions.&lt;/p&gt;

&lt;p&gt;But there is hope! As it turns out, you can solve these problems using
a few basic strategies: grid search, random sweeps, and convergence plots.&lt;/p&gt;

&lt;h1 id=&quot;grid-search&quot;&gt;Grid Search&lt;/h1&gt;
&lt;p&gt;Let’s use the simulated annealing example, and try to tune just two
things: the starting and ending temperatures. To run a grid search,
simply choose a few possible values for each parameter, than try each
combination. Even if you have a continuous range of possible parameter
values, you can sample a wide range to get a sense of what is most
promising. If you want to tune your optimizer further, you can run
a grid search, find the most successful parameter combination, sample
more points around the parameter set you’ve chosen, and repeat.&lt;/p&gt;

&lt;p&gt;Grid search works well for parameters, but you can also use it for
functions. Choose a few candidate mutation functions, and add those to
the grid. In particular, it’s good to try mutation functions that work
in very different ways; I have often been surprised at which mutation
function works best for a given problem.&lt;/p&gt;

&lt;p&gt;The main pitfall of this strategy is time. Grid search is
combinatorially explosive! Each parameter you’re tuning adds an
additional dimension to the grid, so the number of combinations you
need to test can blow up quickly. If you’re tuning many parameters at
once, make sure not to look at too many possible values per
parameter. If I’m tuning a lot of things, I try to stick to 3 options
per parameter.&lt;/p&gt;

&lt;h1 id=&quot;random-sweeps&quot;&gt;Random Sweeps&lt;/h1&gt;
&lt;p&gt;Grid search is the strategy I’ve used in the past, but after reading
Ahmed El Deeb’s &lt;a href=&quot;https://medium.com/rants-on-machine-learning/smarter-parameter-sweeps-or-why-grid-search-is-plain-stupid-c17d97a0e881#.wvt8k0fee&quot;&gt;post on parameter sweeps&lt;/a&gt;,
I’ve come to think that random sweeps are a better way. I highly recommend
you read Ahmed’s clear and concise post on the subject, but in
essence, random search lets you explore more values for each
parameter, which is an especially big improvement over grid search if
you have parameters that differ in importance (which is
often true, even if you don’t know which are important ahead of time).&lt;/p&gt;

&lt;p&gt;The main reason I’m still mentioning grid search is that it’s your
only option for tuning functions. In general, I’d recommend a mixed
approach: use a grid search for functions (including cooling
schedules) in combination with a randomized approach for individual
parameters. Note that parameters and functions are not necessarily
independent of each other, in terms of how they affect how well the
algorithm runs. If you have the computational resources, it’s worth it
to tune all parameters/functions at once, rather than varying one
thing at a time.&lt;/p&gt;

&lt;h1 id=&quot;convergence-plots&quot;&gt;Convergence Plots&lt;/h1&gt;
&lt;p&gt;Grid search and random sweeps are a great way to figure out what
combinations to try, but how do you know which combinations work best?
This is where convergence plots come in. During every tuning run of
your algorithm, save the fitness of the best solution every so often
(every 1000 or 10000 steps is usually good, depending on how fast the
algorithm runs). If you plot number of steps against solution fitness,
you should get a plot that looks something like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/MC3-long.png&quot; alt=&quot;MC3 Convergence Plot&quot; width=&quot;600&quot; height=&quot;450&quot; border=&quot;10&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You’ll notice that the solution usually improves quickly at first,
then more slowly as it converges. The shape of the curve can vary a
lot based on the algorithm and parameter set. Algorithms that focus on
&lt;em&gt;exploitation&lt;/em&gt; tend to converge quickly, while those that focus on
&lt;em&gt;exploration&lt;/em&gt; tend to converge more slowly, often converging on better
solutions overall. Parameter and function choices can also affect the
shape of these curves. You can use these plots to answer many different
questions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Has the algorithm converged?&lt;/strong&gt; If the slope is still decreasing,
you probably need to relax your stop condition (i.e., run the
algorithm for more steps). In my experience, this is the best way
to tune your stop condition. In the example plot above, you can see
that the the solution is still improving slowly, so it’s probably
worth running a while longer.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;How variable is the solution quality?&lt;/strong&gt; That is, if you run the
algorithm with the same parameter set but different random seeds,
how different are the curves? Depending on your use case,
consistency may be more important than finding the “best” solution.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;How do different parameter sets (or different algorithms)
compare?&lt;/strong&gt; It’s possible to tune parameters based only on the final
solution quality, but convergence curves provide additional
information about &lt;em&gt;how&lt;/em&gt; a parameterized algorithm is searching the
space. Does a certain parameter combination help the algorithm
explore more effectively? Does a different parameter combination help
it converge quickly?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Is there anything wrong?&lt;/strong&gt; If an algorithm is converging almost
immediately, there’s probably something wrong with the
implementation, or the mutation function you’ve chosen. Here,
convergence plots can act as a simple diagnostic tool.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;Heuristic optimization algorithms can be frustrating,
especially when you’re trying to tune them. These algorithms are very
flexible, which is a blessing and a curse. You can use them to solve
all kinds of problems, but you pay the cost of having to tweak them
quite a bit if you want something that works really well for your
problem. This is the fundamental trade-off you make: few assumptions,
but few guarantees. Lots of ways to customize and tune,
but…. lots of ways to customize and tune.&lt;/p&gt;

&lt;p&gt;Although it can be time-consuming, if you’re going to use an algorithm
a lot, or if you really need the best solution possible, it’s worth
putting some time into tuning. Fortunately, grid search, random
sweeps, and convergence plots can get the job done. There are a few
specific considerations to keep in mind when tuning different
parameters and functions. Stay tuned (ha!) for a future post on this
topic!&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;a name=&quot;footnote1&quot;&gt;1&lt;/a&gt; Algorithm tuning is a thing that people have
studied. For very specific problems. Have they studied it for your
problem? Maybe. It’s worth a look! If you can identify that your
problem is fundamentally a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Knapsack_problem&quot;&gt;knapsack problem&lt;/a&gt;, or
a 
&lt;a href=&quot;https://en.wikipedia.org/wiki/Travelling_salesman_problem&quot;&gt;travelling salesman problem&lt;/a&gt;,
or another classic well-defined
problem, then a Google Scholar search can let you see what others have
done, and at least get some good ideas for parameter choices and
mutation functions. But even in this basic scenario, you will at least
want to tune your stop conditions, since the &lt;em&gt;size&lt;/em&gt; of your problem is
likely to be different from the one studied.&lt;/p&gt;
</description>
        <pubDate>Tue, 18 Oct 2016 00:00:00 +0000</pubDate>
        <link>/programming/2016/10/18/Basic-Tools-for-Tuning-Heuristic-Optimizers.html</link>
        <guid isPermaLink="true">/programming/2016/10/18/Basic-Tools-for-Tuning-Heuristic-Optimizers.html</guid>
        
        <category>Algorithms</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>My Talk at PyData Chicago 2016</title>
        <description>&lt;p&gt;Last weekend I gave a talk at PyData Chicago! It was called
“Evolutionary Algorithms: Perfecting the Art of ‘Good Enough’” (props
to my thesis advisor, Stefano Allesina, for the catchy title). It was
heavily based on my &lt;a href=&quot;http://www.elisander.com/programming/2015/08/04/Heuristic-Search-Algorithms.html&quot;&gt;blog post&lt;/a&gt; and workshops I’ve given. Heuristic
optimizers are a fun topic for me because they are so general and
useful, but they’re not really a hot topic, and I think a lot of
people have just never encountered them. They’re a great
addition to a data scientist’s toolkit.&lt;/p&gt;

&lt;p&gt;It was my first tech talk, and a lot of fun. It was very different
from giving an academic talk. It was nice to give a talk that was
about getting people excited about a technique, rather than to present
results of a specific study. I’m very proud of the fact that I got to
talk about  managing Skyrim loot as an example of an optimization problem
(it’s quite literally a knapsack problem!).&lt;/p&gt;

&lt;p&gt;You can take a look at the slides here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.elisander.com/pdfs/PyDataTalk.pdf&quot;&gt;Evolutionary Algorithms: Perfecting the Art of “Good Enough”&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the talk itself here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.youtube.com/watch?feature=player_embedded&amp;amp;v=iJ4MiibHt68&quot; target=&quot;_blank&quot;&gt;&lt;img src=&quot;http://img.youtube.com/vi/iJ4MiibHt68/0.jpg&quot; alt=&quot;Evolutionary Algorithms: Perfecting the Art of 'Good Enough'&quot; width=&quot;240&quot; height=&quot;180&quot; border=&quot;10&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
</description>
        <pubDate>Mon, 29 Aug 2016 00:00:00 +0000</pubDate>
        <link>/programming/2016/08/29/My-Talk-At-PyData.html</link>
        <guid isPermaLink="true">/programming/2016/08/29/My-Talk-At-PyData.html</guid>
        
        <category>Python</category>
        
        <category>Algorithms</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>Collaborating with your future self using Markdown documents</title>
        <description>&lt;p&gt;The pipeline for an analysis project can get complicated and
confusing, especially if you’re simulating your own data. I often
create pipelines with several different scripts in different
languages, but it’s easy to forget a step. But a couple of months ago,
I wrote myself a little Markdown file that looks something like this*:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;0. Format data and subsample to build training set
**./RCode/BuildTrainingSet.R**
(for dataset B, use **./RCode/BuildTrainingSetB.R**)

1. Submit several runs of genetic algorithm to computing cluster
**./PythonCode/LaunchGA.py**

2. Check convergence
**./PythonCode/CheckConvergence.py**
(if unconverged, return to 1)

3. Calculate mutual information for best solution
**./RCode/MutualInformation.R**

4. Visualize using alluvial diagrams
**TBD**
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is simply a numbered list of what I need to do to run my entire
analysis from start to finish, complete with paths to the relevant
code, and small notes to myself. I used to trust myself to remember
this pipeline. I mean, I came up with the pipeline, and I commented
the code, and I gave my files informative names. But when I analyze
data, I end up writing lots of small scripts, to explore the structure
of the data and to test assumptions. By the time I need to rerun the
analysis, I’ve lost track of the files I need to run, and the result
involves opening and reading a bunch of files, mixed with some trial
and error running them.&lt;/p&gt;

&lt;p&gt;But why rerun the analysis at all? If you’ve ever published an
scientific article, you probably know why. Reviewers always have changes
to make and model variants to test. This often involves changing a
few lines, then running the entire analysis over again. But even
outside of academia, if you work with data, this is a problem that
you’ll run into. You’ll learn something new about your dataset, or
want to try a different prior or a different algorithm. Writing things
down in a structured way has saved me hours during that process.&lt;/p&gt;

&lt;p&gt;There’s another good reason to write up your process like this. Writing
things down step by step forces you to think through the work that
you’ve done. Why &lt;em&gt;did&lt;/em&gt; you write that script? Does this pipeline make
sense? Are there steps that you did by hand that you need to make
note of, or automate? Making the process clear in your head will make
it clearer when you present your work to someone else. It also
provides a convenient outline for the methods section if you’re
writing a scientific article.&lt;/p&gt;

&lt;p&gt;Why a Markdown file in particular? Well, you could use HTML, or a
simple text file, but I like Markdown for a couple of reasons. First,
it is very simple. If you don’t know how to use it, you can
&lt;a href=&quot;https://daringfireball.net/projects/markdown/syntax&quot;&gt;read an overview&lt;/a&gt;
and learn in about five minutes. It also plays
well with code hosting sites like GitHub and Bitbucket. If you open
the file in github, it will render nicely. But it is also easy to read
in a text editor, because the markup is less intrusive than LaTeX or
HTML.&lt;/p&gt;

&lt;p&gt;I only recently started doing these little Markdown write-ups, because
I somehow believed that I would always remember what my code
did. After all, I wrote it. But after a few months have passed, anyone
could have written that code. In essence, I’m collaborating with past
me, except that past me never responds to emails or answers my
perfectly reasonable questions.&lt;/p&gt;

&lt;p&gt;Now, rather than dealing with the crappy collaborator that is past
me, I like to think about how future me will think of me as a
collaborator. Future me is also kind of a pain to collaborate with. She expects
me to know in advance what questions she has, and she forgets so much
of what I worked on. But it’s not hard to collaborate well with your
future self. You already have a record of your analysis: the
code in your repository. The Markdown notes simply act as a table of
contents. Write one for each of your data pipelines, and your future
self will thank you.&lt;/p&gt;

&lt;p&gt;#####* This example is a simplified version of a project I’m working on right now. The details aren’t important, but including this level of detail in your own write-ups will make it more useful to you.&lt;/p&gt;
</description>
        <pubDate>Fri, 03 Jun 2016 00:00:00 +0000</pubDate>
        <link>/programming/2016/06/03/Collaborating-with-your-future-self-using-Markdown-docs.html</link>
        <guid isPermaLink="true">/programming/2016/06/03/Collaborating-with-your-future-self-using-Markdown-docs.html</guid>
        
        <category>Markdown</category>
        
        <category>Academia</category>
        
        <category>Data Science</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>Capturing Shell Output in R and Python</title>
        <description>&lt;p&gt;Sometimes I spend significant time in R or Python trying to do
something which is trivial is bash. This is especially useful when I’m
working with very large files that will take a long time to read
in. Why read in an entire file to get the last line, when I could just
use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tail -n 1&lt;/code&gt;? Or if I want the line count, why read it in when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wc
-l&lt;/code&gt; will get the job done faster?&lt;/p&gt;

&lt;p&gt;It turns out that it’s not too complicated to capture shell output in
R or Python. Here’s how I do it.&lt;/p&gt;

&lt;h2 id=&quot;python&quot;&gt;Python&lt;/h2&gt;

&lt;p&gt;If you use Python 3, capturing shell output is pretty simple (if
you’re still on Python 2, the tides are turning! It’s time to make the
change!). You can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;subprocess&lt;/code&gt; module to get the output in
bytes, then decode and parse it.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;import subprocess

## Get the last line of the file 'fname'
last_line = subprocess.check_output(&quot;tail -n 1 &quot; + fname, shell = True)
## convert to string and parse
## 'UTF-8' is a common encoding, but you may need to use something else
last_line = last_line.decode('UTF-8').strip()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;r&quot;&gt;R&lt;/h2&gt;

&lt;p&gt;R makes this process easy too. You may have used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;system()&lt;/code&gt; before to
submit shell commands. It turns out that if you set the argument
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;intern = TRUE&lt;/code&gt;, you’ll get the output as a character vector– you
don’t even have to deal with encoding! The output may take some
parsing, but the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stringr&lt;/code&gt; package is good for that.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;require(stringr)
## Get the last line of the file 'fname'
lastLine = system(stringr::str_c(&quot;tail -n 1 &quot;, fname), intern = TRUE)
## strip leading/trailing whitespace
lastLine = stringr::str_trim(lastLine)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This has saved me from reinventing the wheel many times since I
learned it. Hopefully it helps you too!&lt;/p&gt;
</description>
        <pubDate>Thu, 31 Mar 2016 00:00:00 +0000</pubDate>
        <link>/programming/2016/03/31/Capturing-Shell-Output-in-R-and-Python.html</link>
        <guid isPermaLink="true">/programming/2016/03/31/Capturing-Shell-Output-in-R-and-Python.html</guid>
        
        <category>R</category>
        
        <category>Python</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>I Published a Thing in Code Words!</title>
        <description>&lt;p&gt;The &lt;a href=&quot;https://www.recurse.com/&quot;&gt;Recurse Center&lt;/a&gt; puts out a quarterly
publication called Code Words, which publishes articles that try to
capture the fun of digging into
a problem and learning about programming. I wrote a piece on the
grammar of graphics, and how it can provide a language for exploring
and talking about data visualization. It uses examples from R’s
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ggplot&lt;/code&gt; package, but the ideas are more general. Check out my
article, and other great articles from RC alums, in the 
&lt;a href=&quot;https://codewords.recurse.com/issues/six&quot;&gt;sixth issue of Code Words&lt;/a&gt;!&lt;/p&gt;
</description>
        <pubDate>Thu, 17 Mar 2016 00:00:00 +0000</pubDate>
        <link>/programming/2016/03/17/I-Published-a-Thing-in-Code-Words.html</link>
        <guid isPermaLink="true">/programming/2016/03/17/I-Published-a-Thing-in-Code-Words.html</guid>
        
        <category>R</category>
        
        <category>ggplot</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>What '.' Means in R, and Why it Matters</title>
        <description>&lt;p&gt;As far as I can tell, the  R community has
no generally-accepted style
guide. &lt;a href=&quot;https://google.github.io/styleguide/Rguide.xml&quot;&gt;Google&lt;/a&gt; and
&lt;a href=&quot;http://adv-r.had.co.nz/Style.html&quot;&gt;Hadley Wickham&lt;/a&gt;
both have style guides, but across and even within CRAN packages,
different naming and spacing conventions abound. You’re likely to find
variables named in camelCase, snake_case, or, interestingly, dot.case. This
last convention is unusual, because unlike many languages, R does
not enforce specific syntactic meaning for dots. Dots can denote
methods for S3 classes, but they don’t have to. This means that R only
cares about dots &lt;em&gt;sometimes&lt;/em&gt;, with confusing results.&lt;/p&gt;

&lt;p&gt;S3 generic functions, like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;print&lt;/code&gt;, use the function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UseMethod&lt;/code&gt; to
dispatch the appropriate method for the data type. The methods are
named using the construction &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;function.class&lt;/code&gt; (i.e., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;print.lm&lt;/code&gt;,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;print.data.frame&lt;/code&gt;). Sometimes you can even call the methods
directly, without the generic. For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;gt; print('hi')
[1] &quot;hi&quot;
&amp;gt; print.data.frame('hi')
NULL
&amp;lt;0 rows&amp;gt; (or 0-length row.names)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The character object &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;hi&quot;&lt;/code&gt; isn’t a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data.frame&lt;/code&gt;, but R will try to
call it like a normal function anyway. S3 is very unstructured in this
way. The problem is that R can’t always distinguish between
S3 generics/methods and dot case functions:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;gt; example &amp;lt;- function(x) print(mean(x))
&amp;gt; example.two &amp;lt;- function(x) print(sum(x))
&amp;gt; methods(example)
[1] example.two
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
function 'example' appears not to be S3 generic; found functions that look like S3 methods 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here I’ve defined two unrelated functions, but R isn’t sure what to do
with them. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example&lt;/code&gt; isn’t an S3 generic, because I don’t call
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UseMethod&lt;/code&gt; to dispatch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example&lt;/code&gt; methods for different classes, but
because of S3 naming structure, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.two&lt;/code&gt; looks like an S3 method
on the class &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;two&lt;/code&gt;. S3 is so informal that there is no checking to
distinguish between the possibilites, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;methods()&lt;/code&gt; function
lists &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;example.two&lt;/code&gt; as a method (although at least it warns you that
it’s not sure).&lt;/p&gt;

&lt;p&gt;This behavior is both unexpected and problematic. If you’re relying on
a specific behavior from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;methods()&lt;/code&gt;, someone else’s package (or your
own functions!) could give you unreliable results. There is no reason to
use dot case except as convention, and it’s not a set convention, even
in base R. There are even base R functions that use a different case
for the function name and arguments (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;colSums (x, na.rm = FALSE, dims
= 1)&lt;/code&gt;)! I’d love to see an R overhaul, where a new R version is
released with a consistent style and an accompanying style guide. But until
that time, R users should at least stop using dot case.&lt;/p&gt;
</description>
        <pubDate>Thu, 10 Dec 2015 00:00:00 +0000</pubDate>
        <link>/programming/2015/12/10/What-'.'-Means-in-R-and-Why-it-Matters.html</link>
        <guid isPermaLink="true">/programming/2015/12/10/What-'.'-Means-in-R-and-Why-it-Matters.html</guid>
        
        <category>R</category>
        
        
        <category>programming</category>
        
      </item>
    
      <item>
        <title>Creating and populating a database using Python and SQLalchemy. Part 2&amp;#58; Classes and queries</title>
        <description>&lt;p&gt;Last month I wrote a post on
&lt;a href=&quot;http://elisander.com/2015/08/04/SQLalchemy-part-1.html&quot;&gt;the SQLalchemy engine and session&lt;/a&gt;. Now
I’m going to describe how you can set up a mapping for your schema so
that you can populate and query your database.&lt;/p&gt;

&lt;h1 id=&quot;setting-up-your-schema&quot;&gt;Setting up your Schema&lt;/h1&gt;

&lt;p&gt;Setting up your schema correctly is what will allow you to get the
most out of SQLalchemy. I’ll assume you’ve already set up your
session and that it’s stored in the object &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;session&lt;/code&gt;. After this, you’ll
want to import some functions:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, ForeignKey, Integer, String, Float, Boolean
from sqlalchemy import Index
from sqlalchemy.orm import relationship, backref

Base = declarative_base()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All I’ve done here besides importing is to set up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Base&lt;/code&gt;. This is a
basic table class provided by SQLalchemy. When you set up your own
table classes, they will inherit from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Base&lt;/code&gt;, so that many basic table
methods will automatically be available to you.&lt;/p&gt;

&lt;p&gt;Continuing with my example from the previous post, I’ll be using code
snippets that I used to convert the &lt;a href=&quot;https://www.yelp.com/academic_dataset&quot;&gt;Yelp
academic dataset&lt;/a&gt; from JSON
files to a PostgreSQL relational database. Here is what a basic table
class looks like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;class Restaurant(Base):
	__tablename__ = 'restaurant'
	restaurant_id = Column(String(250), index = True, primary_key = True)
	ages_allowed = Column(String(250))
	price_range = Column(Integer)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This code is fairly readable; I’ve told SQLalchemy to call this table
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restaurant&lt;/code&gt;, and I’ve given it the names and types of a few
columns. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restaurant_id&lt;/code&gt; column is the primary key for this table,
and I’ve also created an index for the column, to make queries more efficient.&lt;/p&gt;

&lt;h1 id=&quot;basic-table-relationships&quot;&gt;Basic Table Relationships&lt;/h1&gt;

&lt;p&gt;If your schema includes multiple tables, you will probably want to
establish relationships between them. For example, I also created a
table of restaurant reviews, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;review&lt;/code&gt;, using a class much like
the one above, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Review&lt;/code&gt;. This is a
many-to-one relationship, since a restaurant may have multiple
reviews, but a review will only be associated with a single
restaurant. If I want to establish this relationship through the ORM,
my &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Restaurant&lt;/code&gt; class will have an additional line:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;class Restaurant(Base):
	__tablename__ = 'restaurant'
	restaurant_id = Column(String(250), index = True, primary_key = True)
	ages_allowed = Column(String(250))
	price_range = Column(Integer)
    reviews = relationship('Review', backref = 'restaurant')
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This last line creates a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.reviews&lt;/code&gt; attribute for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Restaurant&lt;/code&gt;. The
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backref&lt;/code&gt; argument also creates a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.restaurant&lt;/code&gt; attribute for class
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Review&lt;/code&gt;. This is syntactic sugar that allows me to set up the whole
relationship in this line, without specifying the relationship
in the class &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Review&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Many to many relationships are slightly more complicated. In my
database, a restaurant may be associated with many categories (Cafe,
Italian, Chinese Food, etc.), and each category will be associated
with many restaurants. This means that I need to set up a table for
categories, and a junction table that joins the restaurant and
category information. I also need to establish relationships between
the restaurant/category tables and the junction table, as follows:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;class Restaurant(Base):
	__tablename__ = 'restaurant'
	restaurant_id = Column(String(250), index = True, primary_key = True)
	ages_allowed = Column(String(250))
	price_range = Column(Integer)
	categories = relationship('Category', secondary = 'restaurant_category')

class Category(Base):
	__tablename__  = 'category'
	category_id = Column(Integer, primary_key = True)
	restaurants = relationship('Restaurant', secondary = 'restaurant_category')
	name = Column(String(250), nullable = False)

class Restaurant_Category(Base):
	__tablename__ = 'restaurant_category'
	category_id = Column(Integer, ForeignKey('category.category_id'),
	                     primary_key = True)
	restaurant_id = Column(String(250), ForeignKey('restaurant.restaurant_id'),
						   primary_key = True)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The gist of this is that I set up a separate junction table in
SQLalchemy, which specifies that its columns are foreign keys, and
that they form a composite primary key for the table (by setting
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;primary_key = True&lt;/code&gt; for both columns). I also set up a relationship
for both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Restaurant&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Category&lt;/code&gt;, telling SQLalchemy that this
relationship is specified by a secondary table, in this case
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restaurant_category&lt;/code&gt;.&lt;/p&gt;

&lt;h1 id=&quot;populating-and-querying-your-database&quot;&gt;Populating and Querying your Database&lt;/h1&gt;

&lt;p&gt;To create these tables in your database of choice, take your engine
object and run:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Base.metadata.create_all(engine)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Populating the database is pretty easy, once you’ve set up the
schema. Adding a category is as simple as:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;category = Category(name=item)
session.add(category)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you want to add a restaurant and link it to the category, you
can append the category to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Restaurant&lt;/code&gt; object:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;restaurant = Restaurant(restaurant_id = 'ABC123', price_range = 1)
restaurant.categories.append(category)
session.add(restaurant)
session.commit()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Querying tables with the ORM is a powerful way to work with your
database, but it takes some getting used to.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;category = session.query(Category)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A basic query on a class like the one above is equivalent to the SQL
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SELECT * FROM category&lt;/code&gt;. Note that it returns a query object, so that
the query can be refined over multiple lines.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;category = category.filter(Category.name == 'Cafe')
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This line will find all category rows with the name ‘Cafe’. Note that
this can also be run as a single line:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;category = session.query(Category).filter(Category.name == 'Cafe')
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There are many other ways to filter and adapt your queries, many of
which are listed in
&lt;a href=&quot;http://docs.sqlalchemy.org/en/rel_1_0/orm/tutorial.html#querying&quot;&gt;this tutorial&lt;/a&gt;. If
you set up logging, you can look at the actual SQL that is being run,
which can help you debug and improve your queries.&lt;/p&gt;

&lt;p&gt;The aspect of SQLalchemy I found most confusing is figuring out how to
access actual table information from the query object. There are a few
approaches to this. If there are multiple rows that match your query,
you can iterate over them or put them all in a list:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;for row in category:
	print row.name

## this is equivalent to the code above,
## but stores each Category object in a list
cats = category.all()
for cat in cats:
    print cat.name
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you only want a single row from your query, you can use
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;category.first()&lt;/code&gt; to get the first match. It’s important to note that
the query object is giving you rows in the form of a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Category&lt;/code&gt;
object. We can use these objects to take advantage of all of the
relationships we set up before:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mycategory = category.first()
## let's get all of the restaurants associated with this category:
category_rest = mycategory.restaurants
## what is the first restaurant that matches?
firstrest = category_rest[0]
## what are all of the categories associated with this restaurant?
some_categories = firstrest.categories
for cat in some_categories:
    print cat.name
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, you can do some complex and recursive things with
these objects! If you have a complicated schema, the overhead of
setting up the ORM in SQLalchemy is, in my opinion, really worth it.&lt;/p&gt;
</description>
        <pubDate>Tue, 08 Sep 2015 00:00:00 +0000</pubDate>
        <link>/programming/2015/09/08/SQLalchemy-part-2.html</link>
        <guid isPermaLink="true">/programming/2015/09/08/SQLalchemy-part-2.html</guid>
        
        <category>Python</category>
        
        <category>SQL</category>
        
        <category>SQLalchemy</category>
        
        
        <category>programming</category>
        
      </item>
    
  </channel>
</rss>
