Population in Politics, Simple Frequency Counts ⇐

Population in Politics, Simple Frequency Counts January 13th, 2016
Patrick Stein

The first coding assignment of the Data Management and Visualization class that I am doing on Coursera is just to do some frequency analysis on some of the variables that will be involved in the research you want to do.

I am using the 2012 U.S. Presidential Election data broken down by county.

Frequency Counts

The assignment was to do frequency counts. If I did tables of raw frequency counts, the tables would be huge. There are 4588 counties in the data set. There are 4075 different values for the total number of votes cast in a county. As such, I bucketed the counts based upon the power of ten of the value. Here is the output for the total number of votes cast:

CL-USER> (print-log-buckets "Total" #'vote-distribution-votes-cast)
+----------------+--------------------+
| Votes Total | Number of counties |
+----------------+--------------------+
| 1's | 3 |
| 10's | 72 |
| 100's | 646 |
| 1,000's | 2102 |
| 10,000's | 1513 |
| 100,000's | 247 |
| 1,000,000's | 5 |
+----------------+--------------------+
NIL

Here is the output for the total number of votes for Democratic candidates and for the Republican candidates:

CL-USER> (print-log-buckets "Dem" #'vote-distribution-dem)
+----------------+--------------------+
| Votes Dem | Number of counties |
+----------------+--------------------+
| 1's | 15 |
| 10's | 166 |
| 100's | 1171 |
| 1,000's | 2380 |
| 10,000's | 730 |
| 100,000's | 124 |
| 1,000,000's | 2 |
+----------------+--------------------+
NIL

CL-USER> (print-log-buckets "GOP" #'vote-distribution-gop)
+----------------+--------------------+
| Votes GOP | Number of counties |
+----------------+--------------------+
| 1's | 12 |
| 10's | 171 |
| 100's | 937 |
| 1,000's | 2349 |
| 10,000's | 1009 |
| 100,000's | 110 |
+----------------+--------------------+
NIL

With the above frequency count, we can see that of the five counties with over a million votes cast, the Democrats got more than a million votes in two of them whilst the Republicans did not get a million votes in any county.

The numbers are pretty close the whole way through, but that still doesn’t mean a great deal. It could be that the fifteen counties where Democrats got fewer than ten votes were counties with ten thousand votes cast. So, I put together a small function then to get the worst counties for a given party:

(defun get-worst-counties (key &optional (how-many 10))
(subseq (stable-sort (copy-seq *by-county*)
#'<
:key (lambda (dist)
(/ (funcall key dist)
(max 1
(vote-distribution-votes-cast dist)))))
0
how-many))

The worst counties for Democrats and Republicans?

CL-USER> (get-worst-counties #'vote-distribution-dem)
(#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "HANCOCK CTY TOWNSHIPS" :DEM 0 :GOP 0 :VOTES-CAST 0)
#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "UPTON" :DEM 0 :GOP 0 :VOTES-CAST 0)
#S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "KING" :DEM 5 :GOP 139 :VOTES-CAST 145)
#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "MORO PLT." :DEM 1 :GOP 21 :VOTES-CAST 23)
#S(VOTE-DISTRIBUTION :STATE "MT" :COUNTY "WIBAUX" :DEM 25 :GOP 421 :VOTES-CAST 544)
#S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "ROBERTS" :DEM 25 :GOP 408 :VOTES-CAST 439)
#S(VOTE-DISTRIBUTION :STATE "ID" :COUNTY "MADISON" :DEM 832 :GOP 13445 :VOTES-CAST 14412)
#S(VOTE-DISTRIBUTION :STATE "ID" :COUNTY "FRANKLIN" :DEM 325 :GOP 5195 :VOTES-CAST 5600)
#S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "STERLING" :DEM 31 :GOP 459 :VOTES-CAST 494)
#S(VOTE-DISTRIBUTION :STATE "TX" :COUNTY "GLASSCOCK" :DEM 44 :GOP 526 :VOTES-CAST 578))

CL-USER> (get-worst-counties #'vote-distribution-gop)
(#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "HANCOCK CTY TOWNSHIPS" :DEM 0 :GOP 0 :VOTES-CAST 0)
#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "UPTON" :DEM 0 :GOP 0 :VOTES-CAST 0)
#S(VOTE-DISTRIBUTION :STATE "SD" :COUNTY "SHANNON" :DEM 2922 :GOP 188 :VOTES-CAST 3130)
#S(VOTE-DISTRIBUTION :STATE "CT" :COUNTY "HARTFORD" :DEM 31735 :GOP 2138 :VOTES-CAST 34037)
#S(VOTE-DISTRIBUTION :STATE "DC" :COUNTY "DISTRICT OF COLUMBIA" :DEM 222332 :GOP 17337 :VOTES-CAST 243348)
#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "PENOBSCOT NATION VOT DST" :DEM 253 :GOP 23 :VOTES-CAST 281)
#S(VOTE-DISTRIBUTION :STATE "NY" :COUNTY "BRONX" :DEM 288378 :GOP 26304 :VOTES-CAST 316047)
#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "ISLE AU HAUT" :DEM 48 :GOP 5 :VOTES-CAST 57)
#S(VOTE-DISTRIBUTION :STATE "MA" :COUNTY "PROVINCETOWN" :DEM 2121 :GOP 210 :VOTES-CAST 2380)
#S(VOTE-DISTRIBUTION :STATE "ME" :COUNTY "MONHEGAN PLT." :DEM 49 :GOP 5 :VOTES-CAST 55))

As you can see from this, there are two counties which show no votes cast. In both of those cases, there are no precincts reporting in the data set. The data set tells the number of precincts in the county along with the number of precincts reporting. These counties with none of the precincts reporting are significant glitches in the data. On the other hand, some counties in the data have hundreds of precincts where all but one reported. I could remove a county from the data if not all of its precincts reported. However, I believe that within a county, single precincts will not differ very much from other precincts which were counted in the data. Further, as I do not have any hope of determining the population density down to the precinct level, I am just going to roll with what I have.

Implementation

I put together some simple utilities around Fare-CSV to retrieve particular columns of a CSV file formatted in particular ways. Here is the source code for those utilities.

One of the things that immediately became apparent is that there are two separate columns in the database labelled "TOTAL VOTES CAST". I wanted to make sure there were no confusion, so I wrote a quick function then to check that both of those columns agree everywhere.

(defun both-total-votes-columns-agree-everywhere ()
(let ((columns (find-columns-with-label "TOTAL VOTES CAST")))
(flet ((votes-cast-agrees (*row*)
(apply #'= (get-columns-as #'parse-integer-allowing-junk
columns))))
(every #'votes-cast-agrees (data-rows)))))

(assert (both-total-votes-columns-agree-everywhere))

Spoiler: They do. Whew!

The data here has one row per county. I might have preferred there be one row per county/candidate pair. Regardless, I wrote a short function that takes a party name and all of the columns identifying parties along with the columns identifying how many votes a given party received.

(defun count-votes (party parties votes)
(loop :for p :in parties
:for v :in votes
:when (string= p party)
:sum v))

For example, this might get arguments party = "DEM", parties = ("DEM" "GOP" "LIB" "GRN" "" "" "" "" "" "" "" "" "" "" "" ""), and votes = (91696 121234 5539 2127 NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL NIL). This function sums up all of the numbers in the votes list where the corresponding entry in the parties list matches the given party.

I made a little data structure to hold the data that I am interested in for each county.

(defstruct vote-distribution
(state "" :type string)
(county "" :type string)
(dem 0 :type integer)
(gop 0 :type integer)
(votes-cast 0 :type integer))

I then created a function which returns a function. The returned function returns an instance of my data structure for the row passed into it. Note: the data set contains rows which roll-up the results for a whole state. For those rows, the FIPS code for the county is zero.

(defun make-votes-by-county-data-collector ()
(let ((state-column (find-column-with-label "State Postal"))
(county-column (find-column-with-label "County Name"))
(fips-column (find-column-with-label "FIPS Code"))
(total-column (find-column-with-label "TOTAL VOTES CAST"))
(party-columns (find-columns-with-label "Party"))
(votes-columns (find-columns-with-label "Votes")))

(lambda (*row*)
(when (plusp (get-column-as #'parse-integer-allowing-junk fips-column))
(let* ((state (get-column-as #'string-upcase state-column))
(county (get-column-as #'string-upcase county-column))
(total (get-column-as #'parse-integer-allowing-junk
total-column))
(parties (get-columns-as #'string-upcase party-columns))
(votes (get-columns-as #'parse-integer-allowing-junk
votes-columns)))
(make-vote-distribution :state state
:county county
:dem (count-votes "DEM" parties votes)
:gop (count-votes "GOP" parties votes)
:votes-cast total))))))

I did that because I originally had all of that functionality in the function which loops over each of the rows in the data set. Now, the function that collects all of these is simpler, but I’m not sure the overall simplicity is much improved.

(defun get-votes-by-county ()
(loop :with collector := (make-votes-by-county-data-collector)
:for row :in (data-rows)
:for dist := (funcall collector row)
:when dist
:collect dist))

(defparameter *by-county*
(stable-sort (get-votes-by-county)
#'<
:key #'vote-distribution-votes-cast))

I created a function to bucket them based on their base-10 logarithm. Of course, this immediately freaked out on the couple of counties for which there is no data in the data set, so I had to take care not to take the logarithm of zero.

(defun log-buckets (&optional (key #'vote-distribution-votes-cast))
(let ((buckets (make-hash-table :test #'equal))
(max-bucket 0))
(labels ((bucket-number (dist)
(floor (log (max (funcall key dist) 1)
10)))
(incorporate (dist)
(let ((n (bucket-number dist)))
(setf max-bucket (max n max-bucket)
(gethash n buckets) (1+ (gethash n buckets 0))))))
(mapc #'incorporate *by-county*)
(loop :for n :to max-bucket
:collect (gethash n buckets 0)))))

I made a wrapper function for that which pretty-prints the results as a table.

(defun print-log-buckets (label &optional (key #'vote-distribution-votes-cast))
(let ((buckets (loop :for pow :from 0
:for buck :in (log-buckets key)
:appending (list (expt 10 pow) buck))))
(format t "+~16,,,'-<~>+~20,,,'-<~>+~%")
(format t "|~16< Votes ~A ~>| Number of counties |~%" label)
(format t "+~16,,,'-<~>+~20,,,'-<~>+~%")
(format t "~{| ~12:D's | ~D ~38T|~%~}" buckets)
(format t "+~16,,,'-<~>+~20,,,'-<~>+~%")))

Here is the source code for all of the above snippets.

nklein software

software development and consulting