Learning Data Engineering
I’m glad that Insight Data Science made their data engineering (DE) fellowship program available last fall, because I was blessed with the opportunity to meet wonderful people and spend time learning data engineering concepts.
Three years ago at PyData in Santa Clara, someone helpfully mentioned AWS to me as a potential server for a project I was working on at the time. While intended to be helpful it actually wasn’t actionable enough to do anything with the feedback, what was aws, how do I get to it, what do I do when I get it? My knowledge of working with servers at the time was nonexistent so the advice went over my head. This digital divide was one of the reasons why I chose to start the NYC PyLadies as a support group as I had very beginner questions and wanted to be surrounded by patient, kind people.
For years I used PythonAnywhere because I had to use Windows for work and open source on the sly, didn’t know how to set up a server let alone connect to it via ssh and make sure it’s properly encrypted. Then I moved onto Heroku to serve web apps. I digress. Fast forward to now after the Insight DE program - going through numerous sweaty brow moments - I not only know how to create a server instance on AWS and back it up, I also learned how to run a cluster with the ever popular Hadoop (HDFS) format, set up pipelines for ingesting, processing, persisting, and serving data in multiple formats.
This post is possible because hosting a NYC PyLadies event was possible. Insight continues to provide support after the program is over. As an alumni and organizer of NYC PyLadies, it is wonderful to have the safe space to not only find but also create learning opportunities, and to network with other professionals. Being a minority demographic has not a barrier but rather an opportunity to try and experiment with new ways to enhance everyones experience. Having access to such a place that has open doors is incredibly encouraging as an organizer and also as an aspiring data engineer.
We had an evening with NYC PyLadies to talk about data careers, with some data engineers describing their work at Rent the Runway and being available to help answer questions from those looking to enter the field.
Data Themed Night with NYC PyLadies
We had lots of new faces tonight for Data Themed Night with NYC PyLadies!
It was wonderful to hear how Anna and Monica help move data along to various teams at Rent the Runway, who need cleaned data for their jobs using Python tools such as Ansible.
Ansible is “an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates.”
I asked how data visualizations are created - many of the data viz and charts needed for reporting are for internal use and created with Tableau. There’s a lot of data needed to know where the inventory is, and the condition that dresses are in. Some of the data science work revolves around providing recommendations to shoppers. It’s a hard problem to describe style and to accurately predict fit so they had to develop their own classification system to group similar dresses together.
Again, super cool to learn about the various tools for managing the various data formats needed to describe clothing and style, as well as user behaviors in the real world. It is so helpful to see what local companies use in production, how engineers tackle the various big data problems. It’s also incredibly generous of RTR to offer us attendees discount codes _
Below is information about the talk and the speakers.
ABSTRACT
Anna and Monica maintain the delicately-balanced data transfer architecture for Rent the Runway as the Data Engineering Team (DET). Keeping the data flowing at RTR means the operations team tracks inventory status by barcode, finance keeps tabs on the revenue health of the business, product monitors experiments through pixel data, users see product recommendations, and so much more.
DET aggregates many diverse internal and external data sources such as databases, log files, and apis into the Great Data Amalgamation called Vertica. Over time DET has transformed the ingestion architecture into a robust python framework that includes server creation, job description, and alerting. In the first part of the talk Anna will expound on the pros (and maybe some cons!) of this infrastructure and the future of this framework. In the second half, Monica will talk about how DET iterates rapidly using python to solve meaningful problems and provide proof of concepts on permanent engineering features.
BIO
Anna Smith (twitter/linkedin) and Monica Quaintance (twitter/linkedin) are Data Engineers at Rent the Runway. They are responsible for the reporting infrastructure, data warehouse scalability, and data quality. Monica is from Atlanta, which means believing all foods can be deep-fried, and has a past in real estate investment banking. Anna is from Washington state, so can only really stand 65 degree temperatures, and went to grad school in Oregon for physics. Some fun past-times of the two include: collecting obfuscated python, creating holiday food dioramas, decision making through data, and shiny things. Oh yes, and dresses. They both love shiny dresses.
Many thanks to Insight Data Science for hosting us.