Picking projects and setting up AWS EC2

Filed under: data
September 8, 2015

Second day! Some talks about picking DE projects. It can be breadth first, meaning a creative coverage of all the open source software available; or depth first, which is focused on just one (maybe two).

Advice

How much data is enough? Not a great way to think about it cuz problems vary. Good data could be more than what can be run on a single computer. If you need a distributed computing cluster, then that’s a good sign of “big data”.
backpressure can happen when a backlog takes up too much memory and makes your system crash
batch processing is easier than realtime to produce

Project management software

We’re invited to use Trello to manage our projects.

Screenshot from [@jofi](https://gist.github.com/jofi/6029000)

Server setup!

We spent the day creating our EC2 instances and installing hadoop.

Day #2 - setting up our AWS instances with Hadoop #insightfellows #aws #hadoop #bigdata

A photo posted by Katy Chuang, PhD (@katychuang.nyc) on Sep 9, 2015 at 11:50am PDT

The schedule for the DE program looks generally like this for a weekly breakdown.

Week 1: Choose project, learn tech
Week 2: MVP
Week 3: Scaling Project
Week 4: Perfecting Project
Week 5-7: Presenting Project
Week 8: Interviews

Imagine, by the end of the 2nd week we would have most of our software installed and data formatted. Heigh-ho off to work we go!

~*~

In the afternoon we had a visit from Hilary Mason, she told us her journey into data science and her work. A few of the things she mentioned;

Stats is a world of probabilities, whereas devops is a world of absolutes.
Communication skills are great to have; so is empathy.
It helps to be able to write deployable code
Data science problems can vary
It’s important to be cognizant of ethical concerns

It’s great to hear someone speak about inclusivity and focusing on soft skills. Muy importante!

First guest talk, by Hilary Mason #insightfellows #datascience #womenintech

A photo posted by Katy Chuang, PhD (@katychuang.nyc) on Sep 9, 2015 at 12:12pm PDT

Then right after for us DE fellows we had a session with Nathan Marz, which was skyped with the SV fellows. It was great to be introduced to them though they were faceless voices.

He gave us some suggestions:

Better to define project well
Queries that count is easy, and especially in batch. Unique counting in realtime is hard, especially because you’d have to index the data.
Quick and dirty can sometimes be ok, by approximating the result and then later reduce inaccuracy with a batch process.
We should explore a number of techniques and skills to demonstrate skills to employers
Image analysis and sentiment analysis can be too complex to tackle and not considered core data engineering

~*~

Still feeling overwhelmed and head spinning with a bunch of new lingo to remember. Is it possible to decide on a project in a matter of days and present it? Only time will tell~