Second day! Some talks about picking DE projects. It can be breadth first, meaning a creative coverage of all the open source software available; or depth first, which is focused on just one (maybe two).
Advice
- How much data is enough? Not a great way to think about it cuz problems vary. Good data could be more than what can be run on a single computer. If you need a distributed computing cluster, then that’s a good sign of “big data”.
- backpressure can happen when a backlog takes up too much memory and makes your system crash
- batch processing is easier than realtime to produce
Project management software
We’re invited to use Trello to manage our projects.
Server setup!
We spent the day creating our EC2 instances and installing hadoop.
The schedule for the DE program looks generally like this for a weekly breakdown.
- Week 1: Choose project, learn tech
- Week 2: MVP
- Week 3: Scaling Project
- Week 4: Perfecting Project
- Week 5-7: Presenting Project
- Week 8: Interviews
Imagine, by the end of the 2nd week we would have most of our software installed and data formatted. Heigh-ho off to work we go!
~*~
In the afternoon we had a visit from Hilary Mason, she told us her journey into data science and her work. A few of the things she mentioned;
- Stats is a world of probabilities, whereas devops is a world of absolutes.
- Communication skills are great to have; so is empathy.
- It helps to be able to write deployable code
- Data science problems can vary
- It’s important to be cognizant of ethical concerns
It’s great to hear someone speak about inclusivity and focusing on soft skills. Muy importante!
Then right after for us DE fellows we had a session with Nathan Marz, which was skyped with the SV fellows. It was great to be introduced to them though they were faceless voices.
He gave us some suggestions:
- Better to define project well
- Queries that count is easy, and especially in batch. Unique counting in realtime is hard, especially because you’d have to index the data.
- Quick and dirty can sometimes be ok, by approximating the result and then later reduce inaccuracy with a batch process.
- We should explore a number of techniques and skills to demonstrate skills to employers
- Image analysis and sentiment analysis can be too complex to tackle and not considered core data engineering
~*~
Still feeling overwhelmed and head spinning with a bunch of new lingo to remember. Is it possible to decide on a project in a matter of days and present it? Only time will tell~