From niels at article19.org Wed Mar 14 14:36:41 2018 From: niels at article19.org (Niels ten Oever) Date: Wed, 14 Mar 2018 22:36:41 +0100 Subject: [Bigbang-dev] IETF archive In-Reply-To: <43de0295-4530-851f-bb30-8f70858cb09a@article19.org> References: <43de0295-4530-851f-bb30-8f70858cb09a@article19.org> Message-ID: Hi all, I hope this email finds you all very well! In preparation for the IETF hackathon I have spent three days downloading ~33 GB of IETF mailinglist archives from ftp.ietf.org/ietf-mail-archive/ I will bring it on a disk to the hackathon in London, but am also gonna make it available as a zipfile from a server, which should allow for much quicker download. Will share the IP address here soon. I will also put an archive for all ICANN mailinglists there. Am still looking for a way to create csv's from the archives to be able to directly use the archives in bigbang (see discussion underneath with Sebastian). All input appreciated! Best, Niels -------- Forwarded Message -------- Subject: Re: quick q Date: Wed, 14 Mar 2018 22:03:36 +0100 From: Niels ten Oever To: Sebastian Benthall Hi Sebastian, Not sure if I am doing correct what you're saying, but: On 03/14/2018 08:20 PM, Sebastian Benthall wrote: > You may have trouble getting all 33 gigs into memory at the same time. > I've never tried that. > > Have you tried creating an Archive object for just one group, as it > illustrated in the example notebooks? > If I use for instance: $ python2 bin/collect_mail.py -u https://www.ietf.org/mail-archive/text/ietf/ I get infinite chardet errors and ends in: DEBUG:chardet.charsetprober:windows-1255 Hebrew confidence = 0.0 tzinfo.utcoffset() returned 1440; must be in -1439 .. 1439 'ascii' codec can't encode character u'\xe4' in position 1084: ordinal not in range(128) Can't export data. Aborting. So this was not a durable way to get all the mailinglists for the hackathon, so that is why I used wget to get them. So now I am looking for a way to make them easily usable for the participants, but am not sure how to do this. Not sure which example notebook you meant I can do this with, all the ones I looked through actually need a csv, or try to download the list themselves. Cheers, Niels > I believe that when it creates one from raw email it will generate a > .CSV file of the same data for you. > > On Mar 14, 2018 1:41 PM, "Niels ten Oever" > wrote: > > In other words, over the past three days I downloaded all these: > > https://www.ietf.org/mail-archive/text/ > > > And now I would like to import them in BigBang, but not sure what > command to use. > > When I try to use the notebooks they are asking for csv's. > > Cheers, > > Niels > > Niels ten Oever > > Article 19 > www.article19.org > > PGP fingerprint    2458 0B70 5C4A FD8A 9488 >                    643A 0ED8 3F3A 468A C8B3 > > On 03/14/2018 06:31 PM, Niels ten Oever wrote: > > Hiya Seb, > > > > All good? I have a quick question. Do you know how I can import > > emaillists that I already have downloaded? In other words, how do I > > create csv's of the 33 GB of mailinglists I just harvested :) > > > > Hope all is well! I think I will be churning on this stuff this night, > > so maybe expect some mails later ;) xx > > > > ~n., > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: