More

natekupp · on Aug 11, 2020

hey bserial, I'm part of the team working on Dagster.

While there are many things we're working on, there are 3 goals that got me excited about working on this system:

1. Local development: most modern workflow orchestration systems don't have a good local development story. We want to provide a seamless end-to-end dev experience from your laptop to CI to dev to prod for authoring data workflows.

2. Complexity: the Airflow deployments I've worked on or otherwise encountered have hundreds of DAGs and thousands of tasks scheduled on an hourly or daily cadence. We aim to provide abstractions to better support managing and wrangling that complexity.

3. Testability: Most modern data platforms are poorly tested. Many orchestration systems, like Airflow, tend to hardcode deployment concerns the business logic, e.g. EmrAddStepsOperator. With Dagster, we aim to separate the business logic from environmental concerns to make it easy to swap out an external resource implementation for a mock, dev version, etc.

Hope that makes sense!

natekupp · on Jan 1, 2020

We use Pulumi to manage both our GCP and AWS resources, and we really like it.

You might consider using Terraform directly if you want something more mature.

natekupp · on Aug 1, 2017

Thumbtack | Software Engineer, SRE, many others | San Francisco, CA | ONSITE

Thumbtack is a local services marketplace that connects millions of customers with the right professionals for anything they need done.

We are a friendly, ambitious team of 100+ engineers in a bright SoMa office with daily home-cooked food, backed by Sequoia and Google Capital. Together, we are disrupting a $700B market in the US alone where word of mouth is still the status quo.

We're looking for engineers and SREs interested in working with Go, Scala/Spark, PHP, Angular, iOS, Android, and AWS/GCP. We're also looking for data scientists interested in predictive modeling, machine learning, and experimental design and analysis. Join us!

http://www.thumbtack.com/jobs http://www.thumbtack.com/engineering Please reach out to jessica [at] thumbtack.com with any questions.

natekupp · on July 8, 2017

We use DynamoDB quite a bit at Thumbtack. Our biggest issue is backups - just wrote a short note about our experiences with DynamoDB here: https://medium.com/@natekupp/dynamodb-and-backups-16dba0dbcd...

Dunedan · on July 8, 2017

Oh yes. Backups for DynamoDB are a pain in the ass, especially as AWS doesn't offer an out-of-the-box solution for that.

That Data Pipeline + EMR solution mentioned in the blog post (here is a better link for it: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGui...) has several drawbacks:

- too many moving parts, especially given the track record of EMR

- might not even be available when your requirement is to keep the data in the same AWS region as the DynamoDB table, as only five regions support Data Pipeline

The best approach I've seen so far is to use DynamoDB Streams and an AWS Lambda function to create incremental backups in an versioned S3-bucket. dynamodb-replicator (https://github.com/mapbox/dynamodb-replicator) implements that together with some scripts to do management tasks like back filling an S3 bucket with data which is already in DynamoDB or joining incremental backups into a single file.

It's still pretty unpolished and definitely needs some love, but I think it's the right approach.

natekupp · on June 11, 2017

The original Science paper is here: http://science.sciencemag.org/content/356/6342/1046.full

The section "Relativistic deflections by foreground stars" walks through the math which enables this, which I found really interesting.

natekupp · on June 11, 2017

Have you considered just applying through the jobs pages of the larger tech companies? Apple, Google, Facebook, LinkedIn are all on the peninsula and more accessible than startups in SF.

natekupp · on June 6, 2017

We're also doing this at Thumbtack. We run all of our Spark jobs in job-scoped Cloud Dataproc clusters. We wrote a custom Airflow operator which launches a cluster, schedules a job on that cluster, and shuts down the cluster upon job completion. Since Google can bring up Spark clusters in < 90s and bills minutely, this works really well for us, simplifying our infrastructure and eliminating resource contention issues.

vgt · on June 6, 2017

Co-Author of the blog here.

Awesome stuff, glad to see folks leveraging the possibilities! Perhaps as a follow-up you could write a guest blog on how this works for you! Feel free to ping me offline.

oh_sigh · on June 7, 2017

Have you tried calculating what percentage increase in cost there would be if you moved to an aws billing style?

Basically I'm curious if your hands are tied to gcp because of the fine grained billing they provide?

natekupp · on June 1, 2017

Thumbtack | Software Engineer, SRE, Data Scientist, many others | San Francisco, CA | ONSITE

Thumbtack is a local services marketplace that connects millions of customers with the right professionals for anything they need done.

We are a friendly, ambitious team of 100+ engineers in a bright SoMa office with daily home-cooked food, backed by Sequoia and Google Capital. Together, we are disrupting a $700B market in the US alone where word of mouth is still the status quo.

We're looking for engineers and SREs interested in working with Go, Scala/Spark, PHP, Angular, iOS, Android, and AWS/GCP. We're also looking for data scientists interested in predictive modeling, machine learning, and experimental design and analysis. Join us!

http://www.thumbtack.com/jobs http://www.thumbtack.com/engineering Please reach out to jessica [at] thumbtack.com with any questions.

natekupp · on May 18, 2017

I'm constantly surprised by how much work Amazon expects its customers to do themselves. The work that Segment has done here should be a service provided by AWS directly, continuously updating cost data in a Redshift database without any customer work required.

We just migrated our data infrastructure to GCP. One of the big motivators was experiences like this with AWS. We've got near-realtime GCP cost dashboards in BigQuery, and the only meaningful work on our end to make that happen was writing the SQL queries.

jdc0589 · on May 18, 2017

Agreed. I don't know why Amazon (and Azure) make this so hard. I've done something pretty similar to what Segment did (except it supports normalizing stats from Azure and AWS), and 90% of the work is stuff you don't feel like you should have to be doing.

imron · on May 18, 2017

> I don't know why Amazon (and Azure) make this so hard.

The company posting this recently managed to save $1,000,000 annually on their AWS bill.

Having confusing billing makes it harder to spot that you're paying too much.

j_s · on May 18, 2017

Yes I think the motivation is clear.

It is the same reason Dropbox doesn't include anything in the web interface to find large files.

--

Edit: tossing in some options to accomplish this; not really related to the current discussion:

Space usage analyser for Dropbox? | https://webapps.stackexchange.com/questions/47440/space-usag...

Unclouded - Cloud Manager | https://play.google.com/store/apps/details?id=com.cgollner.u...

deno · on May 18, 2017

https://en.wikipedia.org/wiki/Confusopoly

L_Rahman · on May 18, 2017

Product development incentive structures.

Imagine you lead the product team at AWS.

- The team is reviewing what to build for the next quarter.

- You have a long list of revenue-generating features

- At the end of the list there's one more item that'll help your customers spend LESS money on your product.

- You can only build so many things and you know that you, your department, and Amazon the company will get pats on the back in $$$ form if you focus on the revenue generating features

- Sure, it'd be really good to help longterm customers understand their costs better, but your biggest ones have the resources to build that infrastructure themselves anyways.

And that is why this is so hard.

claudenm · on May 19, 2017

I think you're absolutely right, but what's hilarious to me is that they have private pricing for many of their services. For the traffic we're already doing, we just asked and they knocked our cloudfront bill down nearly 75%. We didn't have to change our usage at all. Granted, we serve a lot of traffic.

wahnfrieden · on May 19, 2017

What order of magnitude of traffic are we talking?

doubleplusgood · on May 19, 2017

A couple of PB per week, I'd guess

tonyedgecombe · on May 18, 2017

I'm pretty sure it's in Amazon's interest for its costs to be opaque.

Cidan · on May 18, 2017

Second this. We just moved our sizable infrastructure to GCP for more or less the exact same reason. GCP makes all of this a breeze -- I don't understand why other providers can't do the same.

brianwawok · on May 18, 2017

How much more would current customers pay for better billing dashboards?

How many customers quit amazon due to hard to read billing?

I suspect those are the two big questions to explain why this feature is lower on the priority radar...

matt4077 · on May 18, 2017

This way of looking at things has never been particularly convincing to me... Of all the work that's done, at any organisation, almost none of it can truly answer those questions affirmatively.

"That lightbulb is broken!"

"I was going to replace it, but I couldn't find a customer to pay for it, and nobody has threatened to change to a brighter competitor"

brianwawok · on May 19, 2017

Have you ever founded a startup? Did you spend your time working on features users will pay for, or an awesome billing dashboard (assuming billing is not your core product of course?) If so, how successful were you?

It's about focus. Only so many hours in a day, so many dollars, so many developers. If AWS got a nicer billing daashboard but Lambda was delayed by 6 months, how would it have changed the place AWS has in the market?

Yes you could take this too far and never change a light bulb. But I would answer "I will be more productive if I have a nice light bulb, so it is worth the 5 minutes to change it"

sjg007 · on May 18, 2017

I'm sure amazon is working on it.

jdubs · on May 18, 2017

They're letting their partners work on it.

natekupp · on May 1, 2017

Thumbtack | Software Engineer, SRE, Data Scientist, many others | San Francisco, CA | ONSITE

Thumbtack is a local services marketplace that connects millions of customers with the right professionals for anything they need done.

We are a friendly, ambitious team of 100+ engineers in a bright SoMa office with daily home-cooked food, backed by Sequoia and Google Capital. Together, we are disrupting a $700B market in the US alone where word of mouth is still the status quo.

We're looking for engineers and SREs interested in working with Go, Scala/Spark, PHP, Angular, iOS, Android, and AWS/GCP. We're also looking for data scientists interested in predictive modeling, machine learning, and experimental design and analysis. Join us!

http://www.thumbtack.com/jobs http://www.thumbtack.com/engineering Please reach out to jessica [at] thumbtack.com with any questions.