Pypeline: A Python library for creating concurrent data pipelines

anentropic · on Sept 24, 2018

Too much abbreviation!

pypeline --> pypeln

multiprocessing pipeline --> pr

threads pipeline --> th

asyncio pipeline --> io

this is totally unnecessary

If I want to use short abbreviated names in my code I can always `from pypeline import multiprocess_pipeline as pr`

Your library shouldn't export them like this as the default.

`io` is especially bad since this overshadows the `io` module in the Python stdlib

ricw · on Sept 24, 2018

Not a fan of this. You save a couple of chars and massively decrease legibility. Anyone new to the project will find it way harder to understand. Instead of abbreviation hell, just spell it out and move on. Not worth saving a couple of characters.

maxander · on Sept 24, 2018

True masters of naming make sure to use variable names that are long but unambiguous after the first several characters, so that with tab-completion you get the best of both worlds.

cgarciae · on Sept 24, 2018

Point taken! Thanks a lot for your feedback. Just a few points: * pypeline is already taken :( * My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names, but then I decided to go for and import the module kind of strategy.

I am thinking about expanding the module names to their worker names: * pr --> process * th --> thread * io --> task

And then have the conventions * from pypeln import process as pr * from pypeln import thread as th * from pypeln import task as io # as ta?

This conversation is very valuable, thank you all for the feedback.

anentropic · on Sept 24, 2018

Hi, just coming back to say congrats on your library and I wish you all the best with it :)

I see your reasoning here (`import pypeln as pl`) but I still think where you have submodules you should use unabbreviated words for their names.

For me I'd be happy with `pl.process.map` in my code, but `pl.pr.map` feels a bit too obscure to have as the default.

These things are quite subjective of course, but part of that subjective judgement comes from the experience of what is commonly done in other Python libraries (the stdlib is a bit of a mixed bag in this regard unfortunately, riddled with CamelCase and other abominations).

bb88 · on Sept 24, 2018

First of all, it's super cool. :)

There are a lot of "pipe" projects in PyPi, but your project is also about process management. Maybe you should avoid "pipe" in your name perhaps? FlowProcessor? nFlow? xFlow?

I do agree that you should avoid io for asyncio. You should probably at least use aio, but there's no reason you can't have asyncio_task, thread_task, multiprocessing_task.

Lastly, in my mind the killer app for this would be to allow something that works on top of Celery in production, but then be able to fall back to say multiprocessing or threading when running locally. That would allow me to prototype something, and then when I want to scale, I can just change a config setting.

cgarciae · on Sept 24, 2018

Hey, thanks for all the feedback. I will change the naming since its something most of you have agreed is a good change.

The goal I have for Pypeline is much simpler: let you easily setup data pipelines where you leverage processes, threads and asyncio where they are good at. So in my mind a killer app would be a pipeline that maybe starts with an asyncio stage for e.g. downloading images, maybe then a multiprocess stage for e.g. doing image processing, and finally a threading stage for e.g. interacting with the OS.

Right now I see Pypeline more as an easy to use single machine tool instead of a higher level distributed abstraction like Celery. Maybe other framework could leverage Pypeline to ease their work.

bb88 · on Sept 30, 2018

So then I would want to have one more stage... a celery stage for when you want to cluster work across multiple machines. :)

PurpleRamen · on Sept 25, 2018

> My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names

"Assumptions are the root of all evil."

With autocomplete a coder has no reason to use shortnames anyway.

neuromantik8086 · on Sept 24, 2018

"import pypeln as pl" could cause quite a bit of confusion in Poland I would imagine.

bb88 · on Sept 24, 2018

And maybe perl...

cgarciae · on Sept 24, 2018

jajajajaja

dajohnson89 · on Sept 24, 2018

At risk of speaking from ignorance (i dont really know python), isn't this succinctness an aim of python? A lot of python enthusiasts i speak with decry the verbosity of Java and point to how much less code it takes to do the same thing in python.

trymas · on Sept 24, 2018

Python has a famous adage `explicit better than implicit`.

You want to be explicit but still concise while writing python.

I do not have any experience with Java, but I guess when writing Java you can feel that you use too many words than needed (i.e. definition of verbose [0]), wikipedia has a hello world example[1] and it feels just heavy.

IMHO if you write pythonic code, very often it feels like writing/reading prose, seems like talking to computer. You are not too implicit, neither too verbose.

[0] https://www.merriam-webster.com/dictionary/verbose

[1] https://en.wikipedia.org/wiki/Java_(programming_language)#"H...

agumonkey · on Sept 24, 2018

I'm guilty of this, because words are so meaningless to me. I'd rather have single lettre or glyph than ~clear yet ambiguous words.

dec0dedab0de · on Sept 24, 2018

It's really more about readability, In this case threads_pipeline is much easier to understand than th. Descriptive names also means needing less comments.

anentropic · on Sept 24, 2018

The succinctness of Python (such as there is) comes from the syntax itself. On the other hand, obscure variable names are discouraged. It's not a 'code golf' language. The aim is clarity and readability - unobscured by either verbose boilerplate or excessive brevity.

A library author should provide names which are descriptive and clear. Users can then abbreviate them however much or little as we choose (by import aliasing). For example it is very common to see `import numpy as np`... but you wouldn't want them to publish the library as `np`. It should have its proper name.

One reason for this is if I'm exploring code in a REPL or IDE with tab-completion. You want to have some idea what a module is for, without having to play 'guess the abbreviation'.

rcthompson · on Sept 24, 2018

And it's especially unnecessary since Python already supports abbreviation during import. It could just as easily be "from pypeline import process as pr" or something similar.

sebcat · on Sept 24, 2018

Agreed. When it comes to naming things, I really like the way Russ Cox puts it: https://research.swtch.com/names

thenaturalist · on Sept 24, 2018

Absolutely agree with this.

elsherbini · on Sept 24, 2018

Snakemake [0] is a tool worth checking out. You can use it to create declarative workflows, and similar to make, it creates a DAG of dependencies when you give it your desired output. Each rule can specify how many threads it needs and other arbitrary resources and the scheduler uses that to constrain execution. Workflows are architecture independent - you should be able to execute a snakemake workflow on a laptop, in the cloud, or on an HPC cluster.

It also allows you to use UNIX pipes with your dependent jobs when that is appropriate [1].

[0] https://snakemake.readthedocs.io/en/stable/index.html

[1] https://snakemake.readthedocs.io/en/stable/snakefiles/rules....

somewhatoff · on Sept 24, 2018

I wonder if you might compare this to Bonobo [https://www.bonobo-project.org/] which I think has similar design goals?

nestorD · on Sept 24, 2018

Pypeline is a library you use in your code while Bonobo seems to be a framework that use your code. I tend to think that you lose flexibility with the latter.

adamcharnock · on Sept 24, 2018

    Pypeline was designed to solve simple medium 
    data tasks that require concurrency 
    and parallelism but where using frameworks 
    like Spark or Dask feel exaggerated or unnatural.

This is exactly what I was looking for very recently. Thank you for writing this, I'll certainly look into it.

marcyb5st · on Sept 24, 2018

What about Apache Beam? Getting started with the Python SDK has been very easy IMHO. Also, you are future proof as you can easily switch runner from Local to Dataflow/Flink/...

cgarciae · on Sept 24, 2018

I use BEAM for my Dataflow jobs. But their local "DirectRunner" is just for testing purposes. As with Spark, BEAM is a huge beast, Pypeline was created with simplicity in mind, its a pure python library, no dependencies.

philote · on Sept 24, 2018

But not future proof in having to use python 2.7 unfortunately.

chrisjc · on Sept 24, 2018

Seems like a good time to link to this curated list of pipeline toolkits (not all python).

https://github.com/pditommaso/awesome-pipeline/blob/master/R...

neuromantik8086 · on Sept 24, 2018

This too:

https://github.com/common-workflow-language/common-workflow-...

Also, whenever these conversation of flow-based / piplining tools come up, I always like to point people to Common Workflow Language to remind people that there is an attempt at standardizing workflow descriptions so that they can be used with different packages:

https://www.commonwl.org/

snidane · on Sept 24, 2018

From my experience building similar pipelining and reverse polish function application tooling in python.

Piping using the | operator can make tracebacks pretty ugly with some operators.

If you want to keep the code still somewhat 'pythonic' without introducing the syntax magic using |, you can do it similarly:

  range(10)
  | pp.flatmap(lambda x: [x + 1, x + 2])
  | pp.map(lambda x: x * x)
  ...

You can do this instead:

  xs = range(10)
  xs = pp.flatmap(xs, lambda x: [x + 1, x + 2])
  xs = pp.map(xs, lambda x: x * x)
  ...

It helps to keep the operand as first argument, instead of last, because those lambdas are best kept at the end.

So instead of

  map(fn, xs)

do

  map(xs, fn)

roel_v · on Sept 24, 2018

None of these frameworks (there are many) seem to have support for repeating a certain target multiple times, with different arguments. For example, say you have a data set with per-country data; how do you repeat the same analysis on each country? This simple example is easy with a loop, but when you have multiple dimensions like this, you want to call each target with all possible permutations, depending on which type of dimension is actually relevant for that target. Does any ETL framework support that?

(I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)

memoir2comment · on Sept 24, 2018

snakemake does this trivially:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country=country'

    rule analyze_target_countries:
        input: ['analysis.usa.txt', 'analysis.canada.txt', 'analysis.mexico.txt']

elsherbini · on Sept 24, 2018

Small change, you have to use wilcards.country inside the shell call:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country={wildcards.country}'

cgarciae · on Sept 24, 2018

Sounds like a group_by and then do a function per group? Pypeline doesn't have grouping but sounds like Spark or Dask should the the job.

bayesian_horse · on Sept 24, 2018

Dask is relatively lightweight actually, because it is pure Python.

Also, there is "Streamz" which solves a similar problem, seems more mature and can work with or without Dask or Dask-Distributed.

cgarciae · on Sept 24, 2018

Dask might be lightweight internally but resorting to it just to solve a simple task that requires concurrency is not "simple".

Streamz looks nice! However:

"Streamz relies on the Tornado framework for concurrency. This allows us to handle many concurrent operations cheaply and consistently within a SINGLE THREAD."

Apparently you can set it up to use Dask to escape the single threads but that is kind of a global config. With Pypeline you can mix and match between using Processes, Threads, and asyncio.Tasks where it makes sense, resource management per stage is simple and explicit. If you have some understanding of the multiprocessing, threading and asyncio modules, Pypeline will save you tons of time.

Still, will keep an eye on Streamz, its a very nice work, lots of features, it should get more visibility.

bayesian_horse · on Sept 25, 2018

Pypeline doesn't seem that simple itself.

unethical_ban · on Sept 25, 2018

Why not? It seems to be a boilerplate remover for simple parallel processing tasks.

bayesian_horse · on Sept 26, 2018

Maybe I'm just not quite able to get why "lightweight" is a thing. I also prefer Django over Flask for even the simplest of server software...

timkpaine · on Sept 24, 2018

Similar to a library I've been working on as well: https://github.com/timkpaine/tributary

TBastiani · on Sept 24, 2018

mpipe might also be of interest.

http://vmlaker.github.io/mpipe/

cgarciae · on Sept 24, 2018

Thanks! Did take a look at mpipe (its actually referenced in the readme). But mpipe has its flaws:

1. It uses None as the stage terminator, this is VERY error prone, what if you actually want to send None? Pypeline uses a special private terminator.

2. You have to first manually put all the data into the pipe in a for-loop and then manually get it out. In Pypeline all this is simplified: it consumes iterables and all stages are iterables, so its 100% compatible with any function/framework that accepts iterables.

cgarciae · on Sept 24, 2018

3. If you first have to put in all the data first you cant handle streams. Pypeline accepts non-terminating iterables and also gives you back possibly non-terminating iterables.

davidnet · on Sept 24, 2018

Wow. It seems to save a lot of boilerplate code for ETL.

make3 · on Sept 24, 2018

Looks similar to what tf.data does for Tensorflow

cgarciae · on Sept 24, 2018

Nice to hear that. When I wrote this:

"Pypeline was designed to solve simple medium data tasks that require concurrency and parallelism but where using frameworks like Spark or Dask feel exaggerated or unnatural."

it was actually because I've resorted / hacked into tf.data and Dask in the past just to get concurrency and parallelism. Pypeline is way more natural for pure python stuff.