[Bigbang-dev] Analyze Senders and consolidating email senders

Sebastian Benthall sbenthall at gmail.com
Wed May 25 14:07:33 PDT 2016


Sorry for the slow reply.

Long story short: entity resolution on emails is a very thorny problem, and
there are many techniques we might use to solve it.

If I recall correctly, I wound up building my own entity resolver for
performance reasons. Your (awesome) entity resolution requires an n^2
Levenstein distance matrix which was prohibitively expensive for some
longer lists (and combinations of lists, which is what I've spent some time
looking at).

I've made a new ticket for consolidating the various entity resolution
scripts into a single module that documents their differences.

https://github.com/sbenthall/bigbang/issues/252

I suggest we move any further discussion of this issue to that ticket.
There are also I believe some existing tickets on this to be linked there.

- Seb

On Thu, May 5, 2016 at 5:42 PM, Nick Doty <npdoty at ischool.berkeley.edu>
wrote:

> Hi BigBang dev,
>
> I've been turning back to this project and trying to get the code on my
> machine up to date with the subsequent changes to BigBang; in particular,
> the Analyze Senders notebook.
>
> This pull request (using changes from Niels and some fixes of my own)
> returns functionality for generating a matrix of similarities, using the
> new from_header_distance function. The notebook shows walking through this
> similarity, visualizing it with a color map, finding a cutoff for
> similarities and consolidating senders.
>
> https://github.com/sbenthall/bigbang/pull/242
>
> However, I see also that Seb was working on a separate function to do this
> with some graph functionality, in `resolve_sender_entities`. When I ran
> that function on my test mailing list, however, it didn't seem to
> consolidate anything. Maybe I'm misunderstanding how this function works,
> but it would be great to know, especially if it gets more accurate
> similarity calculations or does them faster.
>
> Thanks,
> Nick
>
> _______________________________________________
> BigBang-dev mailing list
> BigBang-dev at lists.sudoroom.org
> https://sudoroom.org/lists/listinfo/bigbang-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sudoroom.org/pipermail/bigbang-dev/attachments/20160525/b1537151/attachment.html>


More information about the BigBang-dev mailing list