Diagnostic tools – or – the pretty visualization is not the end

As the semester and my first graduate digital history class wind down, I’ve been thinking a lot about building DH things for investigation vs. argument.  There’s a lot of good work on tools-as-theory, and whether a digital thing can be a satisfying argument, and an upcoming conference on argumentation in the digital humanities – so I’m not the only one.

I also just finished writing 1-2 pages – maybe 1,000 words – based on a diagnostic tool that it took me over a month to build.  I’m hoping to spin what it tells me out into a longer article in future, but for now I thought I’d share it here, with some commentary on how I made it, what it told me, and why it is not an effective argument.

One of my book chapters is on a group of enslaved and free people in Richmond who raised funds for victims of famine in Ireland.  The First African Baptist Church of Richmond raised just under $35 in 1847. While the amount per congregant was low (the church listed thousands of active members, but many of them were not able to regularly attend because of their enslavement) the donation itself was relatively unique in the church’s history.  This was one of the first times that this congregation raised funds for people not connected with the church.  I have a much longer argument on the political work that this donation did, but I wanted to be able to make some concrete statements about congregants’ experiences in the 1840s.

This was helped by the church minute books, which recorded the names of baptized, excluded and restored members (there were a lot of exclusions for adultery in the 1840s) as well as the names of the men and women who owned the congregants who were enslaved.  So I built a network (using Gephi, which benefits tremendously from the recent update) that showed only relationships characterized by slavery, to see if any white Richmonders were particularly over-represented. (made with sigma.js and the Gephi plugin created by OII)

While some men and women owned more than one congregant, by and large this network was fairly diffuse.  Congregants obviously shared the religious and physical space of the church, but their relationships outside of the church did not seem to be conditioned by their enslavement by particular men and women. (There is an excellent and robust literature on enslaved people in urban spaces, resistance and community building, which I won’t recap here – but suffice it to say that scholars have charted many other ways of relating beyond ownership by the same person, and I assume those modes were at play in 1840s Richmond).

As I put together the database of congregants, I realized that many and unusual names (Chamberlayne, Poindexter, Frayzer, Polland, among others) recurred among both slaveholding and enslaved people.  So I made another network, this one assuming that people who shared a surname had some kind of relationship (this is not a 100% defensible assumption – some of the more common names might have been happenstance).  With those kinds of connections, the network (which includes all of the same people as above) becomes much more dense, with clusters that signify relationships based both in slavery and (most often coerced) sex.

It’s interactive!  It’s dynamic!  It’s a network!

It is not an argument.

At best, this is a tool that lets me locate an individual and see connections.  It relies on two kinds of relationships (and likely overstates the certainly of genetic relationships or previous ownership based on shared surnames).  It helped me to write two pages about the density of connections among black and white Richmonders, and bolster claims about the broader relationships that the First African Baptist Church was embedded in.  It remains an investigative tool.

I think it could be helpful, which is why I am putting it on the internet, but it does not constitute argument.  It does not even constitute analysis (that happened behind the scenes in R).  It did take – from the start of transcription to now – over a month to build.

Was it worth it?  Well, I was able to see connections among the 800+ congregants mentioned in the minute books from 1845-1847 that I would not have been able to see just by reading the names.  I was able to place individuals in a broader social context.  I wrote two pages.  I think that work like this can be tremendously generative, but either happens behind the scenes and only lives on a researcher’s computer, or is presented as the end of an investigative process. This is firmly in the middle of the investigation, but I suppose that has value too.

d3.js + R > Gephi (or, why network analysis helps with history)

Gephi is a very useful tool.  I’m very much looking forward to the new release that seems always on the horizon.  In the meantime, though, every time I open Gephi it crashes, and then I dive down a long rabbit hole of trying to re-write the program code, and then I get angry and go home.  So I’ve been delighted to find that a combination of R (for manipulating and analyzing the data) and d3.js (for visualizing the data) does most of the work of Gephi with much less frustration.

I’ve been using Kieran Healy’s work on Paul Revere and network centrality and applying it to a cohort of men who served on the boards of philanthropic organizations in New York in the 1840s. I am particularly in the officers General Relief Committee for the Relief of Irish Distress of the City of New York. These men – Myndert Van Schiack, John Jay, Jacob Harvey, George Griffin, Theodore Sedgewick, Robert B. Minturn, George Barclay, Alfred Pell, James Reyburn, William Redmond and George McBride Jr. – were deeply politically connected, but don’t seem to have had much of a relationship to one another.

Healy’s script, and Mike Bostock’s d3 blocks helped me to build a matrix which tracked relationships between philanthropists via organizations, making note of the number of organizational connections that different pairs of men shared; and another matrix which tracked relationships between philanthropic organizations and social clubs via philanthropists, making note of the number of men that each organization shared.  I used the former to build a force-directed network diagram, which, in combination with some R based analysis, suggests that while the New York Famine Relief Committee officers didn’t often serve on other committees together, they shared other social connections.

For example Jonathan Goodhue was not a member of the famine relief committee, but served on other committees with nearly every General Relief Committee officer.  Of the New York famine relief committee members, Jacob Harvey was the most centrally connected member.  This data has pointed me in some new archival directions, but also give a much better sense of the ways in which people were connected to one another than comparable textual descriptions might do.



I also built a network diagram showing relationships among different newspapers reporting on the famine, which cluster newspapers more inclined to cite each other.


Famine data

One of the secondary questions of my research has been what themes in famine reporting were dominant among all famine reports in different locales.  What, for instance, was the most common framework for famine reporting in New York in 1847, and how did that differ from the frameworks employed in Britain, the American South or Indian Territory.  I’ve tried a few really clunky ways of representing this, by tracking the number of iterations of certain themes by place and time.  (I should say that these are themes I’ve assigned myself – they differ somewhat from place to place, with major overlaps – and include references to the availability of potato (coded as “potato”), appeals for aid (coded as “appeals”) and discussions of American obligation (coded as “American sympathy”).  As a result, these themes are somewhat subjective – the next step in this visualization is to mine the text of all of the reports I’ve collected, but that’s for another day)

Anyway, as part of this IVMOOC I’m taking while biding time before my defense/trying to grapple with data in a more systematic way, I learned about “burst analysis.”  Basically, this is a way of tracking increased incidences of certain words in articles/titles/subject headings/whatever over time.  Jon Kleinberg, who developed this kind of analysis, describes it as a way of tracking “the appearance of a topic in a document stream [a]s signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges.” So basically, a topic “bursts” when it is discussed with greater and greater frequency (as determined by a set of key words) and the burst ends when that frequency dips.  There’s a lot of math involved in figuring out the “burstiness” of any given theme, but the fabulous Sci2 tool thankfully does all that for me.

So, here’s my first attempt to map “bursts” in famine reporting themes:

I think there are a few interesting things about this visualization, which I’ve intuited but never really seen so clearly.  The first is that the major themes I’ve highlighted in my dissertation “burst” at very different times.  I suspect that this has to do with the speed at which news traveled in the mid-nineteenth century, but the fact that the newspapers of the urban South contained an uptick in discussions about immigration in 1849 is interesting as well.  I also love the little blip of interest in nationalism in New York in the middle of 1847 – there’s a much more extended discussion of the problems facing the Irish nation in 1848, but perhaps later references to nationalism didn’t occur rapidly enough to constitute a “burst.”

“Are you a math person? You look like a math person.”

Having submitted my dissertation for review, I find myself with some time on my hands.  While many people have suggested that this would be an opportune moment to relax my father, who is also an academic, suggested that it merely freed up time to begin new projects! Write articles! Learn new skills!  Having taken one morning off this week to drink cocoa and read a novel, I think I’m all done relaxing and ready to get started.

A few years ago, after a thrilling session on network analysis at the AHA, I decided that I was going to teach myself network analysis.  That, much like undergraduate attempts in stat classes on linear regression analysis populated by econ majors, didn’t go quite as planned, and I mostly gave up and began to rely in IBM’s online ManyEyes software, which produces nice, if slightly clunky visual representations of data.  But just yesterday, I received notice of Indiana University’s free MOOC on information visualization (referred to as IVMOOC, which is really quite fun to say), which is offered just when I need something to occupy my time/keep me from compulsively re-editing a document I’ve already turned in.  The preliminary survey for the course suggests that it’s mostly geared towards people who already have data-driven backgrounds, so for the next eight weeks, I expect to feel much like I did when confronted with Chi-squared problems in my senior year of college – completely over my head, but having loads of fun.

At the same time, I also hope to get acquainted with the open source Quantum GIS software, which seems like it would be a pretty nifty way to deal with the map-making problems I’ve been confronting recently.

Also revising one article.  Also writing another article about the movement of information in the mid-nineteenth century, which hopefully utilizes some of what I’ve picked up from IVMOOC and Quantum GIS.

At any rate, the enthusiasm made possible by my new-found time must have been obvious to the woman sitting next to me during my novel-reading/cocoa-drinking morning off.  As she got up from her seat next to me at the cafe, she turned and said “Are you a math person?  You look like a math person.”  We’ll see.

The more [history] you learn, the more [history] you see


Credit: Bill Amend at http://www.foxtrot.com/

I’ve been throwing out variations on this line since I first saw this strip, and I’ve been having quite a few “the more history you learn…” moments in the past few weeks because of the hurricane.

On Saturday, the Press of Atlantic City reported that NOAA classified Sandy as a post-tropical cyclone right before it made landfall in NJ, a decision which is estimated to save homeowners/cost insurance companies millions of dollars in deductibles.  NOAA isn’t a political body, but the classification is a fortuitous one for those facing insurance claims for their destroyed property, and it was echoed by NJ Governor Chris Christie when he issued an executive order prohibiting insurance companies from charging hurricane deductibles.  (For a really fascinating discussion of the relationship between disasters and flood insurance, see parts II and III of Ted Steinberg’s Acts of God.)  Though most of the article was about the impact of this call on insurance claims, the article briefly digresses into talking about what it means for a scientific body to be in charge – however indirectly – of a huge financial decision:

“If this was a court case, you’d have multiple meteorologists on the stand,” said Campbell H. Wallace, an attorney for the Professional Insurance Agents of New Jersey.

There is no court case. Insurance companies in New Jersey, New York and Connecticut have agreed to waive costly hurricane deductibles, which could have run in the millions of dollars along the three-state area.

Wallace said the insurance industry accepts the fact that the National Weather Service is “legally tasked” with making such determinations. He said meteorologists are judged by their peers and credibility is paramount to them.

The Wallace quote reminds me of another apparently ancillary fact about the Atlantic hurricane – the Galveston Hurricane of 1900 which killed upwards of eight thousand people.  Although meteorologists, both in the U.S. and in Cuba registered concerns about a storm headed for the Gulf of Mexico, the National Weather Bureau’s policy was to limit the use of the word “hurricane” in official correspondence, because it might engender widespread panic.  On top of all of the other reasons for the high Galvestonian death toll (the misguided belief that hurricanes never struck that part of the Gulf, little way for ships to communicate observations from the middle of a storm, buildings that were particularly susceptible to storm damage) some of the blame must go, and has gone, to whomever made the decision that “hurricane” was just too dangerous a word for the American people.

In some ways, what is happening with insurance companies today is the flipside of what happened with the NWB and Galveston – in defining what counts as a hurricane, and what is “merely” a post-tropical cyclone (the two can be differentiated by as little as 1 mph difference in maximum wind speeds measured on the ground) the NOAA is saving – intentionally or no – thousands of people millions of dollars in total.