Picking projects and setting up AWS EC2

Filed under: data

Second day! Some talks about picking DE projects. It can be breadth first, meaning a creative coverage of all the open source software available; or depth first, which is focused on just one (maybe two).


  • How much data is enough? Not a great way to think about it cuz problems vary. Good data could be more than what can be run on a single computer. If you need a distributed computing cluster, then that’s a good sign of “big data”.
  • backpressure can happen when a backlog takes up too much memory and makes your system crash
  • batch processing is easier than realtime to produce

Project management software

We’re invited to use Trello to manage our projects.

Screenshot from [@jofi](https://gist.github.com/jofi/6029000)

Server setup!

We spent the day creating our EC2 instances and installing hadoop.

Day #2 - setting up our AWS instances with Hadoop #insightfellows #aws #hadoop #bigdata

A photo posted by Katy Chuang, PhD (@katychuang.nyc) on

The schedule for the DE program looks generally like this for a weekly breakdown.

  • Week 1: Choose project, learn tech
  • Week 2: MVP
  • Week 3: Scaling Project
  • Week 4: Perfecting Project
  • Week 5-7: Presenting Project
  • Week 8: Interviews

Imagine, by the end of the 2nd week we would have most of our software installed and data formatted. Heigh-ho off to work we go!


In the afternoon we had a visit from Hilary Mason, she told us her journey into data science and her work. A few of the things she mentioned;

  • Stats is a world of probabilities, whereas devops is a world of absolutes.
  • Communication skills are great to have; so is empathy.
  • It helps to be able to write deployable code
  • Data science problems can vary
  • It’s important to be cognizant of ethical concerns

It’s great to hear someone speak about inclusivity and focusing on soft skills. Muy importante!

First guest talk, by Hilary Mason #insightfellows #datascience #womenintech

A photo posted by Katy Chuang, PhD (@katychuang.nyc) on

Then right after for us DE fellows we had a session with Nathan Marz, which was skyped with the SV fellows. It was great to be introduced to them though they were faceless voices.

He gave us some suggestions:

  • Better to define project well
  • Queries that count is easy, and especially in batch. Unique counting in realtime is hard, especially because you’d have to index the data.
  • Quick and dirty can sometimes be ok, by approximating the result and then later reduce inaccuracy with a batch process.
  • We should explore a number of techniques and skills to demonstrate skills to employers
  • Image analysis and sentiment analysis can be too complex to tackle and not considered core data engineering


Still feeling overwhelmed and head spinning with a bunch of new lingo to remember. Is it possible to decide on a project in a matter of days and present it? Only time will tell~