Not a fan of this. You save a couple of chars and massively decrease legibility. Anyone new to the project will find it way harder to understand. Instead of abbreviation hell, just spell it out and move on. Not worth saving a couple of characters.
True masters of naming make sure to use variable names that are long but unambiguous after the first several characters, so that with tab-completion you get the best of both worlds.
Point taken! Thanks a lot for your feedback.
Just a few points:
* pypeline is already taken :(
* My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names, but then I decided to go for and import the module kind of strategy.
I am thinking about expanding the module names to their worker names:
* pr --> process
* th --> thread
* io --> task
And then have the conventions
* from pypeln import process as pr
* from pypeln import thread as th
* from pypeln import task as io # as ta?
This conversation is very valuable, thank you all for the feedback.
Hi, just coming back to say congrats on your library and I wish you all the best with it :)
I see your reasoning here (`import pypeln as pl`) but I still think where you have submodules you should use unabbreviated words for their names.
For me I'd be happy with `pl.process.map` in my code, but `pl.pr.map` feels a bit too obscure to have as the default.
These things are quite subjective of course, but part of that subjective judgement comes from the experience of what is commonly done in other Python libraries (the stdlib is a bit of a mixed bag in this regard unfortunately, riddled with CamelCase and other abominations).
There are a lot of "pipe" projects in PyPi, but your project is also about process management. Maybe you should avoid "pipe" in your name perhaps? FlowProcessor? nFlow? xFlow?
I do agree that you should avoid io for asyncio. You should probably at least use aio, but there's no reason you can't have asyncio_task, thread_task, multiprocessing_task.
Lastly, in my mind the killer app for this would be to allow something that works on top of Celery in production, but then be able to fall back to say multiprocessing or threading when running locally. That would allow me to prototype something, and then when I want to scale, I can just change a config setting.
Hey, thanks for all the feedback. I will change the naming since its something most of you have agreed is a good change.
The goal I have for Pypeline is much simpler: let you easily setup data pipelines where you leverage processes, threads and asyncio where they are good at. So in my mind a killer app would be a pipeline that maybe starts with an asyncio stage for e.g. downloading images, maybe then a multiprocess stage for e.g. doing image processing, and finally a threading stage for e.g. interacting with the OS.
Right now I see Pypeline more as an easy to use single machine tool instead of a higher level distributed abstraction like Celery. Maybe other framework could leverage Pypeline to ease their work.
> My main reason for this was because initially I was thinking that you did an `import pypeln as pl` and then called things like e.g. `pl.pr.map` since you cant abbreviate the module inside `pl` then I picked short names
"Assumptions are the root of all evil."
With autocomplete a coder has no reason to use shortnames anyway.
At risk of speaking from ignorance (i dont really know python), isn't this succinctness an aim of python? A lot of python enthusiasts i speak with decry the verbosity of Java and point to how much less code it takes to do the same thing in python.
Python has a famous adage `explicit better than implicit`.
You want to be explicit but still concise while writing python.
I do not have any experience with Java, but I guess when writing Java you can feel that you use too many words than needed (i.e. definition of verbose [0]), wikipedia has a hello world example[1] and it feels just heavy.
IMHO if you write pythonic code, very often it feels like writing/reading prose, seems like talking to computer. You are not too implicit, neither too verbose.
It's really more about readability, In this case threads_pipeline is much easier to understand than th. Descriptive names also means needing less comments.
The succinctness of Python (such as there is) comes from the syntax itself. On the other hand, obscure variable names are discouraged. It's not a 'code golf' language. The aim is clarity and readability - unobscured by either verbose boilerplate or excessive brevity.
A library author should provide names which are descriptive and clear. Users can then abbreviate them however much or little as we choose (by import aliasing). For example it is very common to see `import numpy as np`... but you wouldn't want them to publish the library as `np`. It should have its proper name.
One reason for this is if I'm exploring code in a REPL or IDE with tab-completion. You want to have some idea what a module is for, without having to play 'guess the abbreviation'.
And it's especially unnecessary since Python already supports abbreviation during import. It could just as easily be "from pypeline import process as pr" or something similar.
Snakemake [0] is a tool worth checking out. You can use it to create declarative workflows, and similar to make, it creates a DAG of dependencies when you give it your desired output. Each rule can specify how many threads it needs and other arbitrary resources and the scheduler uses that to constrain execution. Workflows are architecture independent - you should be able to execute a snakemake workflow on a laptop, in the cloud, or on an HPC cluster.
It also allows you to use UNIX pipes with your dependent jobs when that is appropriate [1].
Pypeline is a library you use in your code while Bonobo seems to be a framework that use your code.
I tend to think that you lose flexibility with the latter.
Pypeline was designed to solve simple medium
data tasks that require concurrency
and parallelism but where using frameworks
like Spark or Dask feel exaggerated or unnatural.
This is exactly what I was looking for very recently. Thank you for writing this, I'll certainly look into it.
What about Apache Beam? Getting started with the Python SDK has been very easy IMHO. Also, you are future proof as you can easily switch runner from Local to Dataflow/Flink/...
I use BEAM for my Dataflow jobs. But their local "DirectRunner" is just for testing purposes. As with Spark, BEAM is a huge beast, Pypeline was created with simplicity in mind, its a pure python library, no dependencies.
Also, whenever these conversation of flow-based / piplining tools come up, I always like to point people to Common Workflow Language to remind people that there is an attempt at standardizing workflow descriptions so that they can be used with different packages:
None of these frameworks (there are many) seem to have support for repeating a certain target multiple times, with different arguments. For example, say you have a data set with per-country data; how do you repeat the same analysis on each country? This simple example is easy with a loop, but when you have multiple dimensions like this, you want to call each target with all possible permutations, depending on which type of dimension is actually relevant for that target. Does any ETL framework support that?
(I was actually just writing a spec for a new tool that does just this this afternoon because I can't find anything suitable)
Dask might be lightweight internally but resorting to it just to solve a simple task that requires concurrency is not "simple".
Streamz looks nice! However:
"Streamz relies on the Tornado framework for concurrency. This allows us to handle many concurrent operations cheaply and consistently within a SINGLE THREAD."
Apparently you can set it up to use Dask to escape the single threads but that is kind of a global config. With Pypeline you can mix and match between using Processes, Threads, and asyncio.Tasks where it makes sense, resource management per stage is simple and explicit. If you have some understanding of the multiprocessing, threading and asyncio modules, Pypeline will save you tons of time.
Still, will keep an eye on Streamz, its a very nice work, lots of features, it should get more visibility.
Thanks! Did take a look at mpipe (its actually referenced in the readme). But mpipe has its flaws:
1. It uses None as the stage terminator, this is VERY error prone, what if you actually want to send None? Pypeline uses a special private terminator.
2. You have to first manually put all the data into the pipe in a for-loop and then manually get it out. In Pypeline all this is simplified: it consumes iterables and all stages are iterables, so its 100% compatible with any function/framework that accepts iterables.
3. If you first have to put in all the data first you cant handle streams. Pypeline accepts non-terminating iterables and also gives you back possibly non-terminating iterables.
"Pypeline was designed to solve simple medium data tasks that require concurrency and parallelism but where using frameworks like Spark or Dask feel exaggerated or unnatural."
it was actually because I've resorted / hacked into tf.data and Dask in the past just to get concurrency and parallelism. Pypeline is way more natural for pure python stuff.
pypeline --> pypeln
multiprocessing pipeline --> pr
threads pipeline --> th
asyncio pipeline --> io
this is totally unnecessary
If I want to use short abbreviated names in my code I can always `from pypeline import multiprocess_pipeline as pr`
Your library shouldn't export them like this as the default.
`io` is especially bad since this overshadows the `io` module in the Python stdlib