Current State With the current state within Fabric (and Data Factory and SQL Server Agent Jobs), we almost always need to execute pipelines in sequences, leading to multiple disadvantages. In the worst-case scenario, we end up with a monolithic pipeline that must ensure each other pipeline is executed in the required order. This complicates development processes and rolling out into existing environments. Furthermore, it sometimes forces us (without handcrafted job control solutions) to execute weekly/monthly jobs, even if they are not needed. Even with handcrafted job control frameworks, we need to develop patterns that wait for the successful execution of other pipelines/dataflows. Cons of the Current State From my point of view, there are several cons with this approach: No flexibility within dataflow designs. Schedules are not objects where multiple jobs can be started within them; rather, they are somewhat a property tied to an object. Unnecessary execution of pipelines just to ensure they run in sequence. Overhead in developing a job control framework individually (we could actually sell ours :D). Big monolithic parent pipelines. Much time spent figuring out which pipelines are needed and where the developed pipeline could fit in. External Dependencies There is a way to get around these issues. To be fair, there would be some disadvantages too, but I think the advantages are much more appealing and outweigh them. Some tools already allow this to be done, and there are possibilities to improve upon them. It is about so-called “external dependencies” and different types of dependencies. For example, we have three pipelines within two business domains: pip1_monthly_etl_dom_sales_orders_fact pip2_daily_etl_dom_sales_dim_customers pip3_daily_etl_dom_production_planning_fact Let’s say production_planning_fact needs sales order information and sales customer data. Furthermore, let’s assume that the data arrives at different points in time (a reality of our life). This explains the need to model a dependency from one task to another task that may reside elsewhere, instead of invoking pipelines within parent pipelines. So we can say something like: pip3_daily_etl_dom_production_planning_fact : Look up if the object pip1_monthly_etl_dom_sales_orders_fact has been executed within the start of this month and was executed successfully. Look for pip2_daily_etl_dom_sales_dim_customers to see if it was executed within a given timeframe (last 24 hours, same day as this task is executed, etc.), and wait for it until it is completed. If we fully embrace the idea of a layered design combined with business domains, we would need concepts from AUTOMIC or Apache Airflow. We could design one scheduler per layer and domain combination, where all jobs reside. Other layer/domain objects being dependent would be able to set a dependency without us even knowing. Lineage would be more telling when looking at it, so we can understand what data processing tasks are not done and which domains need to be informed about incomplete data, etc. Idea 1: Scheduler and Environment A scheduler is a job that runs permanently and invokes multiple pipelines at different timestamps, with the possibility to add maintenance plans. Environments from SSIS could be a great way to define environments with parameters used while triggering pipelines. Prioritizing either pipelines or a scheduler would help ensure the most important jobs run first (just a field called priority for a scheduler or a pipeline). Connections should be parameterizable objects (not directly tied to this :D). A new object, instead of invoking a pipeline: external dependency or pipeline dependency with different wait types and options (like run on complete, don’t run on error, etc.). The current logging for the last state of a pipeline could be used as types of dependency (hard, soft, etc.). What’s Needed What would be needed for this is a bit of a redesign for how executions are triggered and scheduled within Data Factory. Cons of External Dependencies Regarding the cons: There would be a need to think much more modularly than now. If we don’t have a tool that can find dependencies within a stored procedure (it’s hard to scan for FROMs and JOINs, even with CTEs and nested joins), we need to model the dependencies as developers, increasing the risk of errors. Fragmented jobs. Deployments become somewhat harder. Possibility for resource-intensive waits and multiple jobs running in parallel, consuming “too much” CUs. Conclusion I’m sorry for making it a bit longer, but I wanted to bring my point across since the days with SQL Server Agent jobs, where the missing piece to me was often a more flexible way to wait for jobs, instead of having to think in sequentially processed items.
... View more