Hacker News new | past | comments | ask | show | jobs | submit login

For you feedback I'm offering karma or bug bounties (if you can identify me on github)!

Forestry High Resolution Inventory - Predicting Forest Attributes at Landscape Scale ----------------------------------

This a really promising business line. Our pipeline is not very automated and it has no application level data management - we're a ways away from that (e.g. the code isn't even under version control).

For our current business and data volumes, the system works.

We get target attributes (forest characteristics, e.g. species composition) by doing field surveys or other methods (e.g. expert photo interpretation).

We acquire Lidar, Color Infrared data, and Climatic models to develop landscape-scale features. For the lidar derived features we use LAStools. We use Safe software FME for generating some features from the color infrared data (e.g. vegetation indices). We use regional climate models suitable for the area of interest for the climatic indices.

The reference data and the target attributes are spatially joined. We end up with a lot of features and use the subselect library's "improve" function in R in an iterative fashion to reduce the number of features; leave-one-out LDA is used to assess the performance of the subset of features. If the target variable is not categorical, we have to classify it in order to run the LDA. The procedure produces a lot of candidate feature sets; our process to select a particular feature-set is human driven. We do not have a formalized or explicit rule. The subset of feature chosen are used in KNN routine. Some components are in python and state is shared using the filesystem.

We do a lot of tuning on a project by project basis.

At the end of the prediction, there are several transformations to derive other forest characteristics and munge the data into regional standards.

The whole modeling process is done on a somewhat high powered desktop. There is one directory that holds all the code and configs; scripts are invoked which pull settings from a corresponding config. Some of the code is stateful (i.e. stores configuration) and the configs are global (their location is not abstracted) so in order to run this process concurrently, it has to be deployed in separate machines.

Municipal Sewer Backup ----------------------

We ported some of the components described above to predict sewer backup risk. Key components were broken into R and python libraries and dockerized. The libraries are here pylearn - https://github.com/tesera/pylearn and rlearn - https://github.com/tesera/rlearn.

A python library (learn-cli, which we haven't open sourced) uses r2py to coordinate/share state between the two libraries. The process for training still requires a user to select a model from all of the candidates that the variable selection routine produces. This selection is made a prediction time; all the candidate models are stored in one file and an id is specified for which one to use. learn-cli is dockerized and we have it deployed on ECS. It scales pretty well.

This solved many of the challenges in the forestry pipeline, but we haven't been able to bring everything from the forestry model into deployment this way due to a gap in data science and development. I've been looking into Azure Machine Learning as a possible solution for this. I have benchmarked some builtin models there and gotten identical performance as with our highly customized process.

-----------------------------------

Would love to hear your advice for formalizing & automating, or alternatives to, our process for the forestry model pipeline.

Also, if you have a highly automated machine learning pipeline - what are your data scientist responsibilities? It's not clear to me how our data scientist jobs would evolve if they didn't have to manually run several scripts for fit and select a model and generating the predictions.




wow, that sure looks like it could be automated, starting with the indexing passes (script them in Python maybe?) and slowly factoring out bits of it into sklearn modules or whatever. Alternatively, as a batch job spawned from within R, but that would get annoying over time. Still, the idea is great and it sounds like you guys end up with time to go kayaking etc.

I'd think that Microsoft would be bending over backwards to get you to run the R stuff on Azure, since they bought the Revolution operation to show off such things. In the long term I imagine that you'll move most of this to Python.

You might consider poaching some spatial statistics people from Uber. Like, say, some of the women. At least if the market is big and robust enough to bring in some heavy hitters. It seems like a lot of the steps might be automated with convolutions and validated against your expert pool, but I can't say for sure.

Neat projects. Best of luck.


I'd love to hear more about what you're doing. I'm looking a bit into this space and could share some of what I've seen in the market - my email is in my profile.


FYI: HN email field isn't public; you have to put it in the about field too.


Glad to chat if you throw your email into your profile




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: