Sorry for the slow reply.
Long story short: entity resolution on emails is a very thorny problem, and
there are many techniques we might use to solve it.
If I recall correctly, I wound up building my own entity resolver for
performance reasons. Your (awesome) entity resolution requires an n^2
Levenstein distance matrix which was prohibitively expensive for some
longer lists (and combinations of lists, which is what I've spent some time
looking at).
I've made a new ticket for consolidating the various entity resolution
scripts into a single module that documents their differences.
https://github.com/sbenthall/bigbang/issues/252
I suggest we move any further discussion of this issue to that ticket.
There are also I believe some existing tickets on this to be linked there.
- Seb
On Thu, May 5, 2016 at 5:42 PM, Nick Doty <npdoty(a)ischool.berkeley.edu>
wrote:
Hi BigBang dev,
I've been turning back to this project and trying to get the code on my
machine up to date with the subsequent changes to BigBang; in particular,
the Analyze Senders notebook.
This pull request (using changes from Niels and some fixes of my own)
returns functionality for generating a matrix of similarities, using the
new from_header_distance function. The notebook shows walking through this
similarity, visualizing it with a color map, finding a cutoff for
similarities and consolidating senders.
https://github.com/sbenthall/bigbang/pull/242
However, I see also that Seb was working on a separate function to do this
with some graph functionality, in `resolve_sender_entities`. When I ran
that function on my test mailing list, however, it didn't seem to
consolidate anything. Maybe I'm misunderstanding how this function works,
but it would be great to know, especially if it gets more accurate
similarity calculations or does them faster.
Thanks,
Nick
_______________________________________________
BigBang-dev mailing list
BigBang-dev(a)lists.sudoroom.org
https://sudoroom.org/lists/listinfo/bigbang-dev