Population in Politics, Simple Frequency Counts January 13th, 2016
Patrick Stein

The first coding assignment of the Data Management and Visualization class that I am doing on Coursera is just to do some frequency analysis on some of the variables that will be involved in the research you want to do.

I am using the 2012 U.S. Presidential Election data broken down by county.

Frequency Counts

The assignment was to do frequency counts. If I did tables of raw frequency counts, the tables would be huge. There are 4588 counties in the data set. There are 4075 different values for the total number of votes cast in a county. As such, I bucketed the counts based upon the power of ten of the value. Here is the output for the total number of votes cast:

CL-USER> (print-log-buckets "Total" #'vote-distribution-votes-cast)
+----------------+--------------------+
|    Votes Total | Number of counties |
+----------------+--------------------+
|            1's | 3                  |
|           10's | 72                 |
|          100's | 646                |
|        1,000's | 2102               |
|       10,000's | 1513               |
|      100,000's | 247                |
|    1,000,000's | 5                  |
+----------------+--------------------+
NIL

Here is the output for the total number of votes for Democratic candidates and for the Republican candidates:

CL-USER> (print-log-buckets "Dem" #'vote-distribution-dem)
+----------------+--------------------+
|      Votes Dem | Number of counties |
+----------------+--------------------+
|            1's | 15                 |
|           10's | 166                |
|          100's | 1171               |
|        1,000's | 2380               |
|       10,000's | 730                |
|      100,000's | 124                |
|    1,000,000's | 2                  |
+----------------+--------------------+
NIL
CL-USER> (print-log-buckets "GOP" #'vote-distribution-gop)
+----------------+--------------------+
|      Votes GOP | Number of counties |
+----------------+--------------------+
|            1's | 12                 |
|           10's | 171                |
|          100's | 937                |
|        1,000's | 2349               |
|       10,000's | 1009               |
|      100,000's | 110                |
+----------------+--------------------+
NIL

With the above frequency count, we can see that of the five counties with over a million votes cast, the Democrats got more than a million votes in two of them whilst the Republicans did not get a million votes in any county.

The numbers are pretty close the whole way through, but that still doesn’t mean a great deal. It could be that the fifteen counties where Democrats got fewer than ten votes were counties with ten thousand votes cast. So, I put together a small function then to get the worst counties for a given party:

(defun get-worst-counties (key &optional (how-many 10))
  (subseq (stable-sort (copy-seq *by-county*)
                       #'<
                       :key (lambda (dist)
                              (/ (funcall key dist)
                                 (max 1
                                      (vote-distribution-votes-cast dist)))))
          0
          how-many))

The worst counties for Democrats and Republicans?

CL-USER> (get-worst-counties #'vote-distribution-dem)
(#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "HANCOCK CTY TOWNSHIPS" :DEM 0 :GOP 0 :VOTES-CAST 0)
 #S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "UPTON" :DEM 0 :GOP 0 :VOTES-CAST 0)
 #S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "KING" :DEM 5 :GOP 139 :VOTES-CAST 145)
 #S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "MORO PLT." :DEM 1 :GOP 21 :VOTES-CAST 23)
 #S(VOTE-DISTRIBUTION :STATE "MT" :COUNTY "WIBAUX" :DEM 25 :GOP 421 :VOTES-CAST 544)
 #S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "ROBERTS" :DEM 25 :GOP 408 :VOTES-CAST 439)
 #S(VOTE-DISTRIBUTION :STATE "ID" :COUNTY "MADISON" :DEM 832 :GOP 13445 :VOTES-CAST 14412)
 #S(VOTE-DISTRIBUTION :STATE "ID" :COUNTY "FRANKLIN" :DEM 325 :GOP 5195 :VOTES-CAST 5600)
 #S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "STERLING" :DEM 31 :GOP 459 :VOTES-CAST 494)
 #S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "GLASSCOCK" :DEM 44 :GOP 526 :VOTES-CAST 578))
CL-USER> (get-worst-counties #'vote-distribution-gop)
(#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "HANCOCK CTY TOWNSHIPS" :DEM 0 :GOP 0 :VOTES-CAST 0)
 #S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "UPTON" :DEM 0 :GOP 0 :VOTES-CAST 0)
 #S(VOTE-DISTRIBUTION :STATE "SD" :COUNTY "SHANNON" :DEM 2922 :GOP 188 :VOTES-CAST 3130)
 #S(VOTE-DISTRIBUTION :STATE "CT" :COUNTY "HARTFORD" :DEM 31735 :GOP 2138 :VOTES-CAST 34037)
 #S(VOTE-DISTRIBUTION :STATE "DC" :COUNTY "DISTRICT OF COLUMBIA" :DEM 222332 :GOP 17337 :VOTES-CAST 243348)
 #S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "PENOBSCOT NATION VOT DST" :DEM 253 :GOP 23 :VOTES-CAST 281)
 #S(VOTE-DISTRIBUTION :STATE "NY" :COUNTY "BRONX" :DEM 288378 :GOP 26304 :VOTES-CAST 316047)
 #S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "ISLE AU HAUT" :DEM 48 :GOP 5 :VOTES-CAST 57)
 #S(VOTE-DISTRIBUTION :STATE "MA" :COUNTY "PROVINCETOWN" :DEM 2121 :GOP 210 :VOTES-CAST 2380)
 #S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "MONHEGAN PLT." :DEM 49 :GOP 5 :VOTES-CAST 55))

As you can see from this, there are two counties which show no votes cast. In both of those cases, there are no precincts reporting in the data set. The data set tells the number of precincts in the county along with the number of precincts reporting. These counties with none of the precincts reporting are significant glitches in the data. On the other hand, some counties in the data have hundreds of precincts where all but one reported. I could remove a county from the data if not all of its precincts reported. However, I believe that within a county, single precincts will not differ very much from other precincts which were counted in the data. Further, as I do not have any hope of determining the population density down to the precinct level, I am just going to roll with what I have.

Implementation

I put together some simple utilities around Fare-CSV to retrieve particular columns of a CSV file formatted in particular ways. Here is the source code for those utilities.

One of the things that immediately became apparent is that there are two separate columns in the database labelled "TOTAL VOTES CAST". I wanted to make sure there were no confusion, so I wrote a quick function then to check that both of those columns agree everywhere.

(defun both-total-votes-columns-agree-everywhere ()
  (let ((columns (find-columns-with-label "TOTAL VOTES CAST")))
    (flet ((votes-cast-agrees (*row*)
             (apply #'= (get-columns-as #'parse-integer-allowing-junk
                                        columns))))
      (every #'votes-cast-agrees (data-rows)))))

(assert (both-total-votes-columns-agree-everywhere))

Spoiler: They do. Whew!

The data here has one row per county. I might have preferred there be one row per county/candidate pair. Regardless, I wrote a short function that takes a party name and all of the columns identifying parties along with the columns identifying how many votes a given party received.

(defun count-votes (party parties votes)
  (loop :for p :in parties
     :for v :in votes
     :when (string= p party)
     :sum v))

For example, this might get arguments party = "DEM", parties = ("DEM" "GOP" "LIB" "GRN" "" "" "" "" "" "" "" "" "" "" "" ""), and votes = (91696 121234 5539 2127 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL). This function sums up all of the numbers in the votes list where the corresponding entry in the parties list matches the given party.

I made a little data structure to hold the data that I am interested in for each county.

(defstruct vote-distribution
  (state "" :type string)
  (county "" :type string)
  (dem 0 :type integer)
  (gop 0 :type integer)
  (votes-cast 0 :type integer))

I then created a function which returns a function. The returned function returns an instance of my data structure for the row passed into it. Note: the data set contains rows which roll-up the results for a whole state. For those rows, the FIPS code for the county is zero.

(defun make-votes-by-county-data-collector ()
  (let ((state-column  (find-column-with-label "State Postal"))
        (county-column (find-column-with-label "County Name"))
        (fips-column   (find-column-with-label "FIPS Code"))
        (total-column  (find-column-with-label "TOTAL VOTES CAST"))
        (party-columns (find-columns-with-label "Party"))
        (votes-columns (find-columns-with-label "Votes")))

    (lambda (*row*)
      (when (plusp (get-column-as #'parse-integer-allowing-junk fips-column))
        (let* ((state (get-column-as #'string-upcase state-column))
               (county (get-column-as #'string-upcase county-column))
               (total (get-column-as #'parse-integer-allowing-junk
                                     total-column))
               (parties (get-columns-as #'string-upcase party-columns))
               (votes (get-columns-as #'parse-integer-allowing-junk
                                      votes-columns)))
          (make-vote-distribution :state state
                                  :county county
                                  :dem (count-votes "DEM" parties votes)
                                  :gop (count-votes "GOP" parties votes)
                                  :votes-cast total))))))

I did that because I originally had all of that functionality in the function which loops over each of the rows in the data set. Now, the function that collects all of these is simpler, but I’m not sure the overall simplicity is much improved.

(defun get-votes-by-county ()
  (loop :with collector := (make-votes-by-county-data-collector)
     :for row :in (data-rows)
     :for dist := (funcall collector row)
     :when dist
     :collect dist))

(defparameter *by-county*
  (stable-sort (get-votes-by-county)
               #'<
               :key #'vote-distribution-votes-cast))

I created a function to bucket them based on their base-10 logarithm. Of course, this immediately freaked out on the couple of counties for which there is no data in the data set, so I had to take care not to take the logarithm of zero.

(defun log-buckets (&optional (key #'vote-distribution-votes-cast))
  (let ((buckets (make-hash-table :test #'equal))
        (max-bucket 0))
    (labels ((bucket-number (dist)
               (floor (log (max (funcall key dist) 1)
                           10)))
             (incorporate (dist)
               (let ((n (bucket-number dist)))
                 (setf max-bucket (max n max-bucket)
                       (gethash n buckets) (1+ (gethash n buckets 0))))))
      (mapc #'incorporate *by-county*)
      (loop :for n :to max-bucket
         :collect (gethash n buckets 0)))))

I made a wrapper function for that which pretty-prints the results as a table.

(defun print-log-buckets (label &optional (key #'vote-distribution-votes-cast))
  (let ((buckets (loop :for pow :from 0
                    :for buck :in (log-buckets key)
                    :appending (list (expt 10 pow) buck))))
    (format t "+~16,,,'-<~>+~20,,,'-<~>+~%")
    (format t "|~16< Votes ~A ~>| Number of counties |~%" label)
    (format t "+~16,,,'-<~>+~20,,,'-<~>+~%")
    (format t "~{| ~12:D's | ~D ~38T|~%~}" buckets)
    (format t "+~16,,,'-<~>+~20,,,'-<~>+~%")))

Here is the source code for all of the above snippets.

Building Supplies January 8th, 2016
Patrick Stein

I have a couple of nascent projects on my plate which require custom, server-side software. I’ve been trying to use these projects to explore the Clean Architecture concepts using Test-Driven Development (TDD).

I can’t even begin.

According to Clean Architecture, I should start with my application logic independent of whether it will be a command-line tool, a series of tools, a web-application, or what-have-you. So, let me start on the application code.

This is TDD though. So, I need to start with a test. How do I do that?

(ql:quickload :nst)

(nst:def-test-group querying-data () ...)

Wait a minute? I just committed myself to Lisp. I just made a huge business decision, a huge implementation decision, a decision that will shape my life for the next n years, and neither TDD or Clean Architecture had anything to say.

I need building supplies. I can’t go anywhere with my architecture or my development without a programming language.

I want to write my apps in Lisp. One of them, I already have mostly done in Lisp. I am still trepidatious about deploying Lisp.

Why am I trepidatious about deploying Lisp? Is it because there is inadequate support for web programming in Lisp? Absolutely not. Is it because I have some doubt in Lisp’s staying power? Absolutely not.

It’s email. Email is holding me back.

I use my current web hosting provider to give me a LAMP stack upon which I run WordPress for this blog and some git repositories and a bug-tracking database (that I can’t remember how to log into). All of that, I could move with confidence in a few hours.

What I dread is having to collect the several hundred mail-forwards that I currently have along with the half-dozen IMAP accounts and move them anywhere, let alone to somewhere that I have to manage them myself and deal with SPAM and mail queues and crap.

It seems that for only a few bucks more per month than I’m paying now, I can get Plesk on a VPS that should be big enough for my purposes. Does anyone have any experience with Plesk? Is it going to make email painless for me on Ubuntu? Or, am I going to hate my life? Does someone have a VPS provider they strongly recommend?

Do I really have to write my application in PHP on the chance that I’ll want to deploy it on the web just because email is scary?

Population Density in Politics January 3rd, 2016
Patrick Stein

I am taking a Coursera course by Wesleyan University titled Data Management and Visualization.

During Bush v. Gore, there was a ton of freely available data about that election. I did a quick-and-dirty graph showing that Gore won the most populous counties by wide margins and Bush won most of the rest of the counties by wide margins.

For this course, I am going to revisit that analysis with the 2012 election data and maybe the 2008 election data.

The Hypothesis

In the United States, there is a strong correlation between population density and voting for the Democratic candidate for president.

The Data

The 2012, by-county results are available through The Guardian newspaper website. The 2008 election data is available for purchase through Dave Leip’s Election Data Store. The US Census Data website has information available about the population and land area of each U.S. county.

Related Work

My first web search for related data was: correlation population density political party.

This search turns up several articles about a scatterplot by Conor Sen relating the Cook Partisan Voting Index (PVI) plotted against population density based on 2012 data. There are related heat-maps by others from the same data.

That search also turns up a paper by Jowei Chen of the University of Michigan and Jonathan Rodden of Stanford University about why compact voting districts are bad for Democrats. That paper focuses mostly on the shapes of voting districts in Florida and how they have be gerrymandered to make those in population-dense areas very compact while those in less populated areas are tentacled and sprawling and how this results in a higher number of Republican representatives than is warranted by overall population numbers.

A related aspect that shows up in this search is that on specific issues, like transit infrastructure, the congressional voting record is strongly correlated with the population density of the congressperson’s district. This effect is a second-order effect, however. The vote of a congressperson will likely be entwined with what the party as a whole wants as much as (or even more than) their constituents want.

ArcGIS contains a map correlating political affiliation of congresspersons with the population density of their districts.

The Tools

I will, of course, being me, use Common Lisp for all of this. I suspect that I will use Fare-CSV for ingesting CSV data. If I have to parse TIGRE data, I will likely rely on some blend of esrap and CL-EWKB or custom geometry code. For plotting, I will likely rely on Vecto but may also try out some of the other libraries like adw-charting or finally get around to making my own multi-backend charting library.

Grid-Generators v0.4.20151016 (and List-Types v0.4.20151029) Released October 29th, 2015
Patrick Stein

Two new Common Lisp libraries released: GRID-GENERATORS and LIST-TYPES.

GRID-GENERATORS and GRID-ITERATE packages

I often find myself wanting to iterate over the points in a rectangular region of a grid in an arbitrary number of dimensions. If my grid is only one-dimensional or two-dimensional then I often just write nested loops to traverse the points of interest.

(loop :for y :from 0 :to 10 :by 1/2
    :do (loop :for x :from 0 :to 10 :by 1/2
             :do (something x y)))

However, when I want the code to be flexible in the number of dimensions, I always end up writing application-specific code to increment arbitrarily long lists of integers with given bounds. I finally got sick of repeating this code every time I needed it and created a library.

For the particular application that I have in mind though, I wanted more than just walking the points inside some rectangular hyper-prism. I wanted to traverse all of the points at a give taxicab distance from a starting point.

The GRID-GENERATORS package facilitates generating points in a rectangular hypergrid, generating points based on taxicab distance, and generating points based upon the number of steps in an arbitrary lattice (given the generators of the fundamental parallelpiped of the lattice). The GRID-ITERATE package provides ITERATE clauses for those generators.

For example, one could iterate over the same points in my nested example above like this:

(loop :with generator := (make-grid-generator '(10 10) :by '(1/2 1/2))
    :for (x y) := (funcall generator)
    :while x
    :do (something x y))

Or, with iterate:

(iterate:iterate
  (iterate:for point on-grid-to '(10 10) by '(1/2 1/2))
  (destructuring-bind (x y) point
     (something x y)))

LIST-TYPES package

The LIST-TYPES package provides a way to generate `SATISFIES` type clauses that ensure that a list contains elements all of the given type. For example, if I wanted to ensure that my list was entirely rational numbers, I could use the declaration:

(check-type my-list (list-types:list-of rational))

Say What You Mean September 19th, 2015
Patrick Stein

There is a scene in the movie The Birdcage where the son tells his father that he (the son) has met a girl and is going to get married. The father begins gulping down the glass of wine that he has in hand. The son asks, Are you upset? The father finishes the glass of wine and says, Let me tell you why.

Here is a function that I wrote several years ago.

(sheeple:defreply mouse-move ((item =draggable=) xx yy)
  (let ((dragging (dragging item)))
    (when dragging
      (let ((dx (- xx (car dragging)))
            (dy (- yy (cdr dragging))))
        (incf (offset-x item) dx)
        (incf (offset-y item) dy))
      (let ((pp (parent item)))
        (when pp
          (when (< (width pp) (+ (offset-x item) (width item)))
            (setf (offset-x item) (- (width pp) (width item))))
          (when (< (height pp) (+ (offset-y item) (height item)))
            (setf (offset-y item) (- (height pp) (height item))))))
      (when (< (offset-x item) 0) (setf (offset-x item) 0))
      (when (< (offset-y item) 0) (setf (offset-y item) 0))
      t)))

This is awful! Am I upset? Let me tell you why.

Is it the Single Responsibility Principle (SRP)? No.

Is it Don’t Repeat Yourself (DRY)? No.

Is it Mixing Levels of Abstraction? Closer, but not quite.

Those are all clearly violated by this code. But, that’s not really the problem. The problem is Why. Nothing about this code tells you why it is here or what is doing.

There is no way to glance at that function and have any idea what’s going on. You have to read it carefully. You have to understand things that aren’t even in this source file to make head nor tail of it. Once you understand the second LET block, you will have nine more lines of code without the least inkling of why there should be nine more lines of code. Anyone care to hazard a guess as to why this function returns T (only) when we’re dragging?

Encapsulation

Two years ago, a colleague and I were tasked with providing docstrings for every function in all of the code we’d written in the last year. We’d done well on providing docstrings to the outward-facing functions, but now we had to do the rest. He started at one end of the directory (in alphabetical order), and I started at the other end. This gave me a good opportunity to look closely at a boat-load of code he’d written that I’d never really delved into before.

He was absolutely religious about encapsulating containers. If he had a hash-table or a p-list or a flat list in a DEFVAR, there was one and only one function that retrieved items from it and at most one function that added items to it. Those functions were one or two lines each (two if they needed a mutex). Those functions were named after what the collection was storing not what mechanism was used to store them.

A lot of times when people talk about the value of encapsulating, they talk about shielding the rest of the code from the implementation details so that if you need to replace how it’s actually implemented on the back end you can do it without breaking any existing code. You are protecting your precious implementation from how people will use it so that you can someday replace the implementation with an even more precious implementation next year (when your language finally gets first-class functions).

I’ve been coding for a good long time now. I’m going to let you in on a little secret. Code almost never gets replaced. When code does get replaced, it almost never continues to adhere to the old API (there was always a semantic leak). If there is a business justification strong enough to let you replace the code, it’s because the old code has become an unmaintainable mess of people subverting the interface or the code as it is didn’t scale and now synchronous things need to happen asynchronously or local things have to happen remotely and hiding that under your old API isn’t going to relieve the bottlenecks.

Trying to insulate your code so that it’s easy to replace is looking down the wrong end of the telescope. The real benefit of encapsulation is that the people who read your code later can be half-asleep and still get everything—your code will scream its meaning. The real benefit of encapsulation is that the person debugging your code can set a break-point in a place that means something—not in the seventeen places the state might have changed but in the only place it could change.

Making It Better

Any ideas what the body in this function does?

(sheeple:defreply mouse-move ((item =draggable=) xx yy)
  (if (being-dragged-p item)
      (handle-event ()
         (let ((dx (- xx (drag-starting-x item)))
               (dy (- yy (drag-starting-y item))))
           (translate-widget item dx dy)
           (keep-widget-inside-parent item)))
      (ignore-event ())))

The new functions BEING-DRAGGED-P, DRAG-STARTING-X, and DRAG-STARTING-Y are just wrappers around what had been explicitly treated as an (OR NULL (CONS INTEGER INTEGER)).

(defun being-dragged-p (item)
  (dragging item))

(defun drag-starting-x (item)
  (car (dragging item)))
(defun drag-starting-y (item)
  (cdr (dragging item)))

It is still an (OR NULL (CONS INTEGER INTEGER)) but nobody ever has to care. Nobody ever has to try to remember what the integers mean. Sure, you could replace it with a structure or a complex number, but why would you ever bother? Why would you ever look at it again?

The new macros HANDLE-EVENT and IGNORE-EVENT encapsulate the return value of this function into something with meaning.

(defmacro handle-event (() &body body)
  `(prog1
       (values t)
     ,@body))

(defmacro ignore-event (() &body body)
  `(prog1
       (values nil)
     ,@body))

It might still be too easy to write an event-handler with a path which doesn’t end in one of these two macros, but it is way better than that dangling T was. It looks like it’s really supposed to be there, and it looks like what it means rather than what it is.

The TRANSLATE-WIDGET and KEEP-WIDGET-INSIDE-PARENT functions can benefit greatly with some further helper functions (and analogous functions for top and bottom):

(defun left (item)
  (offset-x item))
(defun (setf left) (x item)
  (setf (offset-x item) x))
(defun right (item)
  (+ (left item) (width item)))
(defun (setf right) (x item)
  (setf (offset-x item) (- x (width item))))

Some Rules of Thumb

If you find that when you want to check (PRED1 ...) you instead have to check:

(and (PRED0 ...)
     (PRED1 ...))

Then you should consider making a function that does them both. Consider the difference between these two blocks of code:

(when (and (connectedp (player1 g))
           (connectedp (player2 g))
           (not (pausedp g)))
  ...)

(when (game-active-p g)
  ...)

If you find that you are depending on the NULL-ness or positiveness or some other property of some number of state variables to decide which course of action to take, then you should consider making predicates named after your state. In many OO scenarios, you may even want to explicitly track (or calculate) which state you are in at all times.

(defmacro state-case (g &body clauses)
  `(ecase (calculate-or-fetch-state-of g)
     ,@clauses))

(state-case g
 (:pause-screen-showing
  ...)
 (:settings-menu-showing
  ...))

In more imperative languages, it may even be beneficial to keep a STATE member variable in your class. When doing that, make sure that there is one and only one function which actually mutates the value of that STATE member. This will let you:

  1. Log all state transitions without having to hunt for all of them.
  2. Quickly hunt for all of them if you want to do that
  3. Set a break point on all state changes.
  4. Enforce the validity of transitions (or at least scream loudly when something transitions from STOPPED to PAUSED without having passed through PLAYING first).

If you have to check whether some resource is being used by some instance, don’t ask it which resource it is using, ask it whether it is using the one you want.

;;; Common: Reader is forced to know each player has one socket and
;;;    that sockets are comparable with #'=
(loop :for player :in all-networked-players
      :until (= socket-with-something-happening
                (player-socket player))
      :finally (return player))

;;; Better: All I wanted to know is, "Is this yours?"
(loop :for player :in all-networked-players
      :until (player-using-socket-p player
                                    socket-with-something-happening)
      :finally (return player))

Encapsulation is about protecting the person who has to read your code. It’s not about protecting your code.

l