Sorry for the slow reply.

Long story short: entity resolution on emails is a very thorny problem, and there are many techniques we might use to solve it.

If I recall correctly, I wound up building my own entity resolver for performance reasons. Your (awesome) entity resolution requires an n^2 Levenstein distance matrix which was prohibitively expensive for some longer lists (and combinations of lists, which is what I've spent some time looking at).

I've made a new ticket for consolidating the various entity resolution scripts into a single module that documents their differences.

https://github.com/sbenthall/bigbang/issues/252

I suggest we move any further discussion of this issue to that ticket. There are also I believe some existing tickets on this to be linked there.

- Seb

On Thu, May 5, 2016 at 5:42 PM, Nick Doty <npdoty@ischool.berkeley.edu> wrote:
Hi BigBang dev,

I've been turning back to this project and trying to get the code on my machine up to date with the subsequent changes to BigBang; in particular, the Analyze Senders notebook.

This pull request (using changes from Niels and some fixes of my own) returns functionality for generating a matrix of similarities, using the new from_header_distance function. The notebook shows walking through this similarity, visualizing it with a color map, finding a cutoff for similarities and consolidating senders.

https://github.com/sbenthall/bigbang/pull/242

However, I see also that Seb was working on a separate function to do this with some graph functionality, in `resolve_sender_entities`. When I ran that function on my test mailing list, however, it didn't seem to consolidate anything. Maybe I'm misunderstanding how this function works, but it would be great to know, especially if it gets more accurate similarity calculations or does them faster.

Thanks,
Nick

_______________________________________________
BigBang-dev mailing list
BigBang-dev@lists.sudoroom.org
https://sudoroom.org/lists/listinfo/bigbang-dev