I agree that it doesn't seem like a good practice to include the data itself in the research tool repository. Maybe we could just include a version of the script that would download all the data needed for that group of notebooks? So for people getting started with that set of notebooks, the instructions would be to clone the repository, run setup as necessary, and then run a script that downloads the relevant mail.
I think with many of the notebooks now, there are function calls with URLs in place for gathering the appropriate mail archive, and that if the archive is already downloaded, it doesn't have to repeat the process.
Generally it's good not to have all the data one works with checked into version control.
Actually currently no data is checked into version control. When you install BigBang you have to run the collect_mail scripts before getting anything out of the notebooks.
If there's a project that uses BigBang for extensive analysis of data from a single source, then it's probably best to keep that as a fork and have it update from the core repository.
What I'm wondering now is whether all, some, or none of the Summer School notebooks should make it in as is. Currently there are many near-duplicate notebooks in the examples/ directory, along with a lot of other stuff from previous uses of the software.
Some hard work that's going to need to happen soon is pruning and standardizing the stuff in that directory. Along the way we should come up with code quality guidelines and standards for new notebooks.
BigBang-dev mailing listBigBangfirstname.lastname@example.org