Responsibilities:
•Setting up infrastructure and running large-scale data processing with Spark, Hive, Hadooop, etc. on AWS
• Assisting in sourcing, designing, integrating, and managing large complex geospatial datasets using APIs and AWS cloud tools (EC2, S3, Sagemaker)
• Building and supporting production-quality Airflow data pipelines
• Processing datasets on Google Earth Engine
• Continuous integration and deployment (CI/CD, Github)
• Preparing and maintaining technical documentation for datasets and deployed models (metadata, data dictionaries, code annotation, process diagrams)
• Performing exploratory data analysis, visualizing information using existing and new tools (QGIS, Python, R, etc.)
• Developing testing frameworks and tests of database and model quality and performance
Required Qualifications:
• 5+ years experience with Dask, Ray, GDAL, or Kubernetes
• 5+ years experience with at least one large-scale distributed data processing framework (Spark, Hadoop, Hive, etc.)
• 5+ years experience with creating and deploying data pipelines on AWS
• 5+ years experience with AWS (Lambda, EC2, S3, managed Airflow, Sagemaker, VPCs)
• Degree in Computer Science, Data Science, Engineering, Geography, Remote Sensing, or other highly quantitative disciplines.
• High level of proficiency in Python or R
• Ability to collaborate and communicate with both technical and non-technical stakeholders
• Ability to work independently and make key infrastructure decisions in a startup environment
• Experience in geospatial data, remote sensing techniques, satellite imagery, or computer vision
• Experience working with large raster or image data sets.
This job is already closed and no longer accepting applicants, sorry.