‘Data scientist’ has been a hot role for a while now. Many organizations in fact front-load their teams with analysts and data scientists before knowing what problems they want to work on or setting up data pipelines.
But before you get value out of your analysts you need to make sure that data is flowing in your organization.
How do you make data flow?
Step 1: Collect, clean and store
You must record all important data from the apps and systems you own. If you rely on external data, you need to collect, clean and store that as well.
The arrows may involve automated or semi-automated collection, cleaning for basis issues and then a data load step.
The data store can just be files on your PC for your individual projects. But this means your colleagues can’t easily access data that you have collected.
Using a server-based database management system like postgres or MySQL allows many people to connect and use the data and adds the ability to make data flow outward more easily.
There are also cloud-based serverless technologies like Google BigQuery that can act as your data store.
If you work with geospatial datasets, postgres (with the postgis extension) and BigQuery are good options to start with. Many geospatial tasks like distance of a point from a line or whether a point is inside a polygon or finding the nearest landmark can handled here with simple queries.
But handling multi-dimensional raster datasets (like weather model data) is more challenging. If your final purpose is to display the data on a website, a tile-based model with GeoTIFFs can work well. If you want to perform analysis however, netCDF has been the go to format for most experts.
But netCDF files have several limitations. You have to do the file management yourself.
The pangeo project has been trying to solve this by using the power of the cloud with an ecosystem that uses xarray, dask and zarr to store practically unlimited raster data and throw the necessary compute power to process them.
The technology stack for your data store should depend on the desired functionality, number of users, the type and size of data and the ways in which the users will want to use the data.
Step 2: Reach the right audience in the right way
Some of your people might want to dig deep into the data and build models using statistical tools. Others might just want to plot a trendline without having to write code.
You may want to showcase some visualizations on your website or plug it into a system that needs it to produce its own outputs.
For analysts and management who will need no-code tools, there are now many options like PowerBI, Tableau, Metabase, Google Data Studio, Kibana etc. Most of these tools connect to several databases.
Apache Superset is an emerging open source option that is very powerful and can support large teams of collaborating analysts.
We also need to build APIs or custom pipelines to connect data to code and other systems. The advantage of APIs is that you don’t have to make separate linkages to each target system. Any system that needs the data can directly query the API.
But some times the API functionality may be limited in terms of the amount of data and the level of detail. In such situations you can use direct database connections leveraging packages like SQLAlchemy for python.
Most ML platforms also pre-connect with several databases and you can figure out which ones work for you.
Installing all the necessary software and packages can be daunting for many analysts. There are managed environments that can simplify this.
A managed coding environment in the cloud like Google Colab is easy to bring up and scales with your requirement. The JupyterHub project enables all of this in the open source world. You can preinstall the necessary packages for your users and they can login via the browser and turn on the data tap!
This lets you complete the picture!
The industry term for making data flow is ‘data engineering’. You have to make sure you have the data engineering resources and infrastructure in place before you can extract the full potential of the data you have!
Earthmetry is a provider of datasets and hosted environments to enable your analysts to work with important datasets in energy, air pollution, climate and other related areas.
We don’t just give you the data. We can make it flow to the right audience in the right way!