Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pypeline: A Python library for creating concurrent data pipelines (github.com/cgarciae)
121 points by cgarciae on Sept 24, 2018 | hide | past | favorite | 46 comments


Too much abbreviation!

pypeline --> pypeln

multiprocessing pipeline --> pr

threads pipeline --> th

asyncio pipeline --> io

this is totally unnecessary

If I want to use short abbreviated names in my code I can always `from pypeline import multiprocess_pipeline as pr`

Your library shouldn't export them like this as the default.

`io` is especially bad since this overshadows the `io` module in the Python stdlib


Not a fan of this. You save a couple of chars and massively decrease legibility. Anyone new to the project will find it way harder to understand. Instead of abbreviation hell, just spell it out and move on. Not worth saving a couple of characters.


True masters of naming make sure to use variable names that are long but unambiguous after the first several characters, so that with tab-completion you get the best of both worlds.


Point taken! Thanks a lot for your feedback. Just a few points: * pypeline is already taken :( * My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names, but then I decided to go for and import the module kind of strategy.

I am thinking about expanding the module names to their worker names: * pr --> process * th --> thread * io --> task

And then have the conventions * from pypeln import process as pr * from pypeln import thread as th * from pypeln import task as io # as ta?

This conversation is very valuable, thank you all for the feedback.


Hi, just coming back to say congrats on your library and I wish you all the best with it :)

I see your reasoning here (`import pypeln as pl`) but I still think where you have submodules you should use unabbreviated words for their names.

For me I'd be happy with `pl.process.map` in my code, but `pl.pr.map` feels a bit too obscure to have as the default.

These things are quite subjective of course, but part of that subjective judgement comes from the experience of what is commonly done in other Python libraries (the stdlib is a bit of a mixed bag in this regard unfortunately, riddled with CamelCase and other abominations).


First of all, it's super cool. :)

There are a lot of "pipe" projects in PyPi, but your project is also about process management. Maybe you should avoid "pipe" in your name perhaps? FlowProcessor? nFlow? xFlow?

I do agree that you should avoid io for asyncio. You should probably at least use aio, but there's no reason you can't have asyncio_task, thread_task, multiprocessing_task.

Lastly, in my mind the killer app for this would be to allow something that works on top of Celery in production, but then be able to fall back to say multiprocessing or threading when running locally. That would allow me to prototype something, and then when I want to scale, I can just change a config setting.


Hey, thanks for all the feedback. I will change the naming since its something most of you have agreed is a good change.

The goal I have for Pypeline is much simpler: let you easily setup data pipelines where you leverage processes, threads and asyncio where they are good at. So in my mind a killer app would be a pipeline that maybe starts with an asyncio stage for e.g. downloading images, maybe then a multiprocess stage for e.g. doing image processing, and finally a threading stage for e.g. interacting with the OS.

Right now I see Pypeline more as an easy to use single machine tool instead of a higher level distributed abstraction like Celery. Maybe other framework could leverage Pypeline to ease their work.


So then I would want to have one more stage... a celery stage for when you want to cluster work across multiple machines. :)


> My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names

"Assumptions are the root of all evil."

With autocomplete a coder has no reason to use shortnames anyway.


"import pypeln as pl" could cause quite a bit of confusion in Poland I would imagine.


And maybe perl...


jajajajaja


At risk of speaking from ignorance (i dont really know python), isn't this succinctness an aim of python? A lot of python enthusiasts i speak with decry the verbosity of Java and point to how much less code it takes to do the same thing in python.


Python has a famous adage `explicit better than implicit`.

You want to be explicit but still concise while writing python.

I do not have any experience with Java, but I guess when writing Java you can feel that you use too many words than needed (i.e. definition of verbose [0]), wikipedia has a hello world example[1] and it feels just heavy.

IMHO if you write pythonic code, very often it feels like writing/reading prose, seems like talking to computer. You are not too implicit, neither too verbose.

[0] https://www.merriam-webster.com/dictionary/verbose

[1] https://en.wikipedia.org/wiki/Java_(programming_language)#"H...


I'm guilty of this, because words are so meaningless to me. I'd rather have single lettre or glyph than ~clear yet ambiguous words.


It's really more about readability, In this case threads_pipeline is much easier to understand than th. Descriptive names also means needing less comments.


The succinctness of Python (such as there is) comes from the syntax itself. On the other hand, obscure variable names are discouraged. It's not a 'code golf' language. The aim is clarity and readability - unobscured by either verbose boilerplate or excessive brevity.

A library author should provide names which are descriptive and clear. Users can then abbreviate them however much or little as we choose (by import aliasing). For example it is very common to see `import numpy as np`... but you wouldn't want them to publish the library as `np`. It should have its proper name.

One reason for this is if I'm exploring code in a REPL or IDE with tab-completion. You want to have some idea what a module is for, without having to play 'guess the abbreviation'.


And it's especially unnecessary since Python already supports abbreviation during import. It could just as easily be "from pypeline import process as pr" or something similar.


Agreed. When it comes to naming things, I really like the way Russ Cox puts it: https://research.swtch.com/names


Absolutely agree with this.


Snakemake [0] is a tool worth checking out. You can use it to create declarative workflows, and similar to make, it creates a DAG of dependencies when you give it your desired output. Each rule can specify how many threads it needs and other arbitrary resources and the scheduler uses that to constrain execution. Workflows are architecture independent - you should be able to execute a snakemake workflow on a laptop, in the cloud, or on an HPC cluster.

It also allows you to use UNIX pipes with your dependent jobs when that is appropriate [1].

[0] https://snakemake.readthedocs.io/en/stable/index.html

[1] https://snakemake.readthedocs.io/en/stable/snakefiles/rules....


I wonder if you might compare this to Bonobo [https://www.bonobo-project.org/] which I think has similar design goals?


Pypeline is a library you use in your code while Bonobo seems to be a framework that use your code. I tend to think that you lose flexibility with the latter.


    Pypeline was designed to solve simple medium 
    data tasks that require concurrency 
    and parallelism but where using frameworks 
    like Spark or Dask feel exaggerated or unnatural.
This is exactly what I was looking for very recently. Thank you for writing this, I'll certainly look into it.


What about Apache Beam? Getting started with the Python SDK has been very easy IMHO. Also, you are future proof as you can easily switch runner from Local to Dataflow/Flink/...


I use BEAM for my Dataflow jobs. But their local "DirectRunner" is just for testing purposes. As with Spark, BEAM is a huge beast, Pypeline was created with simplicity in mind, its a pure python library, no dependencies.


But not future proof in having to use python 2.7 unfortunately.


Seems like a good time to link to this curated list of pipeline toolkits (not all python).

https://github.com/pditommaso/awesome-pipeline/blob/master/R...


This too:

https://github.com/common-workflow-language/common-workflow-...

Also, whenever these conversation of flow-based / piplining tools come up, I always like to point people to Common Workflow Language to remind people that there is an attempt at standardizing workflow descriptions so that they can be used with different packages:

https://www.commonwl.org/


From my experience building similar pipelining and reverse polish function application tooling in python.

Piping using the | operator can make tracebacks pretty ugly with some operators.

If you want to keep the code still somewhat 'pythonic' without introducing the syntax magic using |, you can do it similarly:

  range(10)
  | pp.flatmap(lambda x: [x + 1, x + 2])
  | pp.map(lambda x: x * x)
  ...
You can do this instead:

  xs = range(10)
  xs = pp.flatmap(xs, lambda x: [x + 1, x + 2])
  xs = pp.map(xs, lambda x: x * x)
  ...
It helps to keep the operand as first argument, instead of last, because those lambdas are best kept at the end.

So instead of

  map(fn, xs)
do

  map(xs, fn)


None of these frameworks (there are many) seem to have support for repeating a certain target multiple times, with different arguments. For example, say you have a data set with per-country data; how do you repeat the same analysis on each country? This simple example is easy with a loop, but when you have multiple dimensions like this, you want to call each target with all possible permutations, depending on which type of dimension is actually relevant for that target. Does any ETL framework support that?

(I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)


snakemake does this trivially:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country=country'

    rule analyze_target_countries:
        input: ['analysis.usa.txt', 'analysis.canada.txt', 'analysis.mexico.txt']


Small change, you have to use wilcards.country inside the shell call:

    rule analyze_country:
        input: 'whatever.{country}.txt'
        output: 'analysis.{country}.txt'
        shell:
            'run-analysis-on-country {input} {output} --country={wildcards.country}'


Sounds like a group_by and then do a function per group? Pypeline doesn't have grouping but sounds like Spark or Dask should the the job.


Dask is relatively lightweight actually, because it is pure Python.

Also, there is "Streamz" which solves a similar problem, seems more mature and can work with or without Dask or Dask-Distributed.


Dask might be lightweight internally but resorting to it just to solve a simple task that requires concurrency is not "simple".

Streamz looks nice! However:

"Streamz relies on the Tornado framework for concurrency. This allows us to handle many concurrent operations cheaply and consistently within a SINGLE THREAD."

Apparently you can set it up to use Dask to escape the single threads but that is kind of a global config. With Pypeline you can mix and match between using Processes, Threads, and asyncio.Tasks where it makes sense, resource management per stage is simple and explicit. If you have some understanding of the multiprocessing, threading and asyncio modules, Pypeline will save you tons of time.

Still, will keep an eye on Streamz, its a very nice work, lots of features, it should get more visibility.


Pypeline doesn't seem that simple itself.


Why not? It seems to be a boilerplate remover for simple parallel processing tasks.


Maybe I'm just not quite able to get why "lightweight" is a thing. I also prefer Django over Flask for even the simplest of server software...


Similar to a library I've been working on as well: https://github.com/timkpaine/tributary


mpipe might also be of interest.

http://vmlaker.github.io/mpipe/


Thanks! Did take a look at mpipe (its actually referenced in the readme). But mpipe has its flaws:

1. It uses None as the stage terminator, this is VERY error prone, what if you actually want to send None? Pypeline uses a special private terminator.

2. You have to first manually put all the data into the pipe in a for-loop and then manually get it out. In Pypeline all this is simplified: it consumes iterables and all stages are iterables, so its 100% compatible with any function/framework that accepts iterables.


3. If you first have to put in all the data first you cant handle streams. Pypeline accepts non-terminating iterables and also gives you back possibly non-terminating iterables.


Wow. It seems to save a lot of boilerplate code for ETL.


Looks similar to what tf.data does for Tensorflow


Nice to hear that. When I wrote this:

"Pypeline was designed to solve simple medium data tasks that require concurrency and parallelism but where using frameworks like Spark or Dask feel exaggerated or unnatural."

it was actually because I've resorted / hacked into tf.data and Dask in the past just to get concurrency and parallelism. Pypeline is way more natural for pure python stuff.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: