More

james_woods · on Sept 17, 2024

Citing an article from MIT Sloan Management Review:

"But there’s no clear evidence that these mandates improve financial performance. A recent study of S&P 500 companies that was conducted by University of Pittsburgh researchers found that executives are “using RTO mandates to reassert control over employees and blame employees as a scapegoat for bad firm performance.” Those policies result in “significant declines in employees’ job satisfaction but no significant changes in financial performance or firm values,” they concluded." (https://sloanreview.mit.edu/article/return-to-office-mandate...)

Peroni · on Sept 17, 2024

>When we look back over the last five years, we continue to believe that the advantages of being together in the office are significant.

>we’ve observed that it’s easier for our teammates to learn, model, practice, and strengthen our culture;

They believe there are advantages to being fully in-office because they have observed these things. Zero mention of actual data to back up their beliefs and observations despite substantial industry data verifying that remote and hybrid work environments are more beneficial on almost every level.

stubybubs · on Sept 18, 2024

They were called out on this for not living up to their "data driven culture" when they went to 3 days back and the response was basically "Uh well... Eat shit I guess."

pm90 · on Sept 17, 2024

At the very least I would have expected some kind of internal survey data. But they’ve offered 0. Just vibes.

valbaca · on Sept 17, 2024

“Data-driven culture” until a board full of Boomers just kind of feels differently

wildrhythms · on Sept 17, 2024

The companies are getting tax breaks. https://www.bloomberg.com/news/features/2023-02-21/another-t...

james_woods · on Dec 22, 2022

HELLO HAMBINI FANS AND WELCOME

james_woods · on Jan 21, 2022

(from the post) Tableau is checking to see if the host has changed. This can happen all the time in the cloud, for example GCP has a live migration feature that keeps you going even if the underlying hardware fails. Also, restarting a VM is not guaranteed to boot up on the same host. The problem: Support cannot reset the license count, until you have reached the limit (max of 3 activations), i.e. participated in a lottery and hit the jackpot by locking up your system: https://kb.tableau.com/articles/issue/error-tableau-server-i...

james_woods · on Jan 3, 2022

If you stop buying animal products that lowers the demand for them, so it has long-term effects, donating $100 is a short-term effect.

Semaphor · on Jan 3, 2022

The source [0] of the article makes it sound that no actual, physical animals are being saved, but instead long-term efforts are being undertaken that the evaluators somehow converted to physical animals.

[0]: https://animalcharityevaluators.org/charity-review/the-human...

james_woods · on July 10, 2021

SQL was not made for programmers alone. It has been invented also for not so technical people so that verboseness and overhead is part of the deal.

>When Ray and I were designing Sequel in 1974, we thought that the predominant use of the language would be for ad-hoc queries by planners and other professionals whose domain of expertise was not primarily data- base management. We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.

https://ieeexplore.ieee.org/document/6359709

zabzonk · on July 10, 2021

> We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.

In that case it has been an abject failure. I have been using SQL since the mid 1980s (so pretty much since the start of its widespread adoption) and I have never met "ordinary people" (by which I assume intelligent business-oriented professionals) who could (or wanted to) cope with it.

I like it, but the idea of sets does not come naturally to most people (me included). But I once worked with a programmer who had never been exposed to SQL - I leant him an introductory book on it and he came in the next day and said "Oh, of course, it's all sets, isn't it?" and then went on to write some of the most fiendish queries I have ever seen.

papito · on July 10, 2021

I don't think this is disputed that the original goal of SQL was a flop. The designers grossly underestimated the technical chops of a layman. However, I would argue that us tech people did benefit from that original goal of simplicity. I mean, SELECT first, last FROM employees WHERE id = 10 is not too bad at all. Kind of elegant, no?

If SQL was designed "by engineers for engineers", you would be using esoteric Git commands just to blow off steam.

zabzonk · on July 10, 2021

> Kind of elegant, no?

Oh, I agree, which I said "I like it", and compared to things like CODASYL it was and is shining sanity.

hvocode · on July 10, 2021

This is very important. I personally like old technologies and languages where the designers considered users who had limited technical skills, and most importantly, assumed that those users had no interest or need to improve their technical skills. Removing the assumption that users are willing to increase their technical sophistication forces a designer to think more about what they're designing. Looking at older languages is interesting - for all their warts, they do feel more intentional in their design than modern things that have a clear developer-centric mindset baked in.

rswail · on July 10, 2021

It's similar to the discussions around COBOL back in the day. What's interesting is that the primary "end user" language is Excel formulas. Who would have thought a declaratory relational language with matrices would "win" that battle.

Any arguments that "users will write their own" languages are basically flawed. Users want results, if there's no alternative, they'll do it themselves, in the simplest, but probably most inefficient way possible.

sam_lowry_ · on July 10, 2021

This is a recurring theme. I was once amazed to learn that ClearCase was made for lawyers to use. The name makes suddenly a lot of sense!

james_woods · on June 15, 2021

What about aerodynamics? https://www.swissside.com/blogs/news/back-to-the-wind-tunnel

james_woods · on Feb 14, 2021

Nice! Btw: Is there a list with projects that got done due to the pandemic? :)

james_woods · on Jan 22, 2021

Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.

namibj · on Jan 22, 2021

Late data is very deliberately not handled. The reasoning for that is best available at [0]. Now, there are ways [1] to handle bitemporal data, but they have fairly significant issues in ergonomics and performance, due to the additional work needed to allow the bitemporal aggregations.

As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).

Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...

james_woods · on Jan 22, 2021

Taken from [0] If you wanted to use the information above to make decisions, it could often be wrong. Let's say you want to wait for it to be correct; how long do you wait?

I know how long I want to wait, 30 minutes in one of my cases as I know that I've seen 95% of the important data by then. In the streaming world there is _always_ late data so being able to tell what should happen when the rest (5%) arrives is crucial for me.

This differs from use-case to use-case for me and being able to configure this and handling out-of-order data at scale is key for me when selecting a framework for stream processing. Apache Beam and Apache Flink do this very well.

Taken from [1]: Apache Beam has some other approach where you use both and there is some magical timeout and it only works for windows or something and blah blah blah... If any of you all know the details, drop me a note. It obviously only works when you window your data as it needs to fit in memory. The event-time and system-time concept from Beam and Flink are very similar, also the watermark approach. Thank you for sharing the links, For me it is now clearer where the difference lies between differential-dataflow and stream-processing frameworks (which also offer SQL and even ACID conformity!). I'm using Beam/Flink in production and missing out on one of these mentioned points is a deal-breaker for me.

jamii · on Jan 22, 2021

What do you usually want to happen with late data? In DD you have the option to ignore it at the source but not to update already-emitted results. Is the latter important for you?

namibj · on Jan 23, 2021

In DDflow, you could also use the `Product` timestamp combinator, and track both the time that event came from, as well as the time you ingested it. You can then make use of the data as soon as the frontier says it's current for the relevant ingestion timestamp, and occasionally advance the frontier for the origin timestamp at the input, so that arrangements can compact historic data. An affected example would be a query that counts "distinct within some time window". It only has to keep that window's `distinct on` values around as long as you can still feed events with timestamps in that window. If you are no longer able to, the values of the `distinct on` become irrelevant for this operator, and only the count for that window needs to be retained.

james_woods · on Jan 23, 2021

If I have to report transaction (aka money) then yes. I need to update already emitted results. If it's just a log-based metric for internal use then no.

What I would like to have is a choice - and Apache Beam for example lets you choose this.

james_woods · on Aug 19, 2020

I am using Apache Airflow since a couple of years now and the biggest improvement was the addition of the Kubernetes Operator. You basically keep a vanilla Airflow installation and the custom code is encapsulated and tested in containers. This simplifies it a lot.

recov · on Aug 19, 2020

This is what we do - it's great for decoupling the two. Any heavy work is run in GKE with some custom operators. It also makes on boarding non engineers much easier as they don't have to worry about connections/credentials/etc.

james_woods · on Oct 27, 2017

„and for settlers trying to persuade Indians to abandon tribal lands that these newcomers wished to inhabit.„

Ah - they persuaded them. What a lovely euphemism.

kbart · on Oct 27, 2017

You have taken it out of context, where it was ironically written about the most common uses of revolvers, so it was clearly a joke (the whole article was written in humorous manner, for those who didn't read).

elsurudo · on Oct 27, 2017

That's... the joke.