I’m working with a Hadoop cluster of about ~100 machines and manage a fairly complex pipeline of jobs that are scheduled in control-m. There are about 50-ish jobs that are co-linked to each other as dependencies in order to run. Some of the data sets we read in are 20 billion records. We don’t have a proper test or dev environment for any of this stuff because we lack the resources to replicate all of our data. I’m looking for some advice on how we approach setting something like another environment up.

One idea I was looking into was integrating Spring Cloud Config Server with Spark and having the ability to simply swap what config file I wish to use when ordering up a job in control-m. The only problem I’ve run into with this is that it causes jobs that may be co-linked to trigger early. Ie I could run jobA using TEST configuration, but this causes jobB to run with PROD config because the condition was met. Unfortunately there is no way around this without using control-m jobs as code and their config transformer.

Another option was using the control-m api to export the jobs as code and apply a transform file that would transform all of the conditions so that they don’t collide with the production ones. I could setup a Jenkins pipeline that would promote the code to control-m for each build and that would help keep things in sync. The only issue I have with this is that we create dependencies into control-m for our stuff to work, and future plans seem to point to us moving away from control-m and into AWS.

Third option was combining both the jobs as code in control-m with spring cloud config so isolate configuration files from the job code itself.

This still doesn’t solve the issue of how we could possibly replicate our data – some of our jobs take third party data as inputs, and replicating another flow of jobs for TEST might cause a race condition where multiple jobs are trying to read from the same input files at the same time. I don’t know how complex or bad the scheduling conflicts can get.

Anyone have any experience in this area and could provide some pointers? Maybe some other tech alternative?

submitted by /u/Igneous001
[link] [comments]