Job Objectives
Design and deliver scalable real-time data and machine learning solutions by building robust ingestion and transformation frameworks across Hadoop ecosystems. Enable end-to-end ML model operationalization and performance optimization, while supporting multi-modal data processing and development of engineering tools and applications.
Key Responsibilities
- Design and develop highly scalable, Real time systems using Hadoop ecosystem components(Iceberg, Spark, Ozone, Trino, Hive, Ranger, Kafka, Flink and Nifi)
- Build robust data ingestion and transformation frameworks using Java, Spark, Python, and shell scripting for ingesting multi model data(image, audio, video, unstructured documents) with both batch and real-time.
- Develop full‑stack applications and internal engineering tools using Python, shell scripting, and modern web frameworks (e.g., Flask, React).
- Collaborate closely with data scientists to operationalize machine learning models using Cloudera Machine Learning (CML).
- Perform performance tuning and optimization of data applications on Hadoop to ensure optimal resource utilization.
Skillset
- Experience working with ML platforms such as CML, Spark MLlib, and Python ML libraries (scikit‑learn, XGBoost), including model deployment.
- Bachelor's or Master's degree in Computer Science, Engineering, Information Technology, or a related field.
- Minimum of 6+ years of professional experience
- Design and develop highly scalable, Real time systems using Hadoop ecosystem components(Iceberg, Spark, Ozone, Trino, Hive, Ranger, Kafka, Flink and Nifi)
- Build robust data ingestion and transformation frameworks using Java, Spark, Python, and shell scripting for ingesting multi model data(image, audio, video, unstructured documents) with both batch and real-time.
- Develop full‑stack applications and internal engineering tools using Python, shell scripting, and modern web frameworks (e.g., Flask, React).
- Collaborate closely with data scientists to operationalize machine learning models using Cloudera Machine Learning (CML).
- Perform performance tuning and optimization of data applications on Hadoop to ensure optimal resource utilization.
Key Skills:
Experience with Python, Java, Scala, or C++
ML Frameworks & Libraries - XGBoost, Scikit‑learn, Tensor Flow/keras, Hugging face (NLP/NLQ/Gen AI use cases)
Full-Stack Development
Performance Optimization
Data Engineering & Ingestion Frameworks
Collaboration with Data Science Teams