
huhuhuhr 昨天 晚上 11:21
ETL can usually tell me how many pieces of data are extracted, how many pieces of data are cleaned and converted, and how many pieces of data are stored in the database.
My task is scheduled every minute. I want to count how much data my task has extracted, cleaned, converted, and finally loaded into the database in this minute. I have encountered some problems.
1. I can't know how many pieces of data are extracted, cleaned, converted, and loaded in one scheduling. If the complete life cycle of the data is counted as a job, I can't know whether the job is over.
2. Nifi events are very powerful, but FlowFile has events such as create, fork, send, and drop. If you want to associate the complete life cycle of a FlowFile, you need to query nifi api many times. One extraction will generate a FlowFileuuid, and in some scenarios, the blood lineage graph of the FlowFile will be broken. For example, 100 pieces of data are extracted, and then Cut it into pieces of data, and create a lot of data through this piece of data. The original FlowFile's pedigree may be create fork drop, and many pieces of data produced by fork are create send drop. It is difficult to display the complete process of a FlowFile.
My question is, is there a good way for me to use provenance data to calculate how much data is extracted and loaded in a schedule?
5 条回复
Joe Witt 7 小时前
folks that come to NiFi from traditional ETL tools often look for this type of batch/job view on top of nifi flows.
Joe Witt 7 小时前
We don't really make this easy today though. As you note NiFi's provenance data certainly could/should have this but extracting this sort of view over the data is actually pretty tough the way we make it available now
Joe Witt 7 小时前
We do need to offer a notion of a 'job' view whereby from some initiation point we can tell you definitively how many records came from that, where they went, etc..
Joe Witt 7 小时前
For now you'd have to build around what we offer. But we need to make this better. It is a common/good ask
huhuhuhr 11 分钟前
Thank you for your answer. There is a data lake open source project kylo with nifi as the bottom layer. It seems to be doing well for event statistics.
I have been using nifi for three or four years, and I have expanded many nar packages. If I have the opportunity, I will learn how to become a contributor and make nifi better together.
不知道工作区管理员是不是项目直接设计者,但是他透露一些问题。
1、NIFI是没有传统ETL的批处理和作业视图的;
2、从NIFI数据中提取数据视图实际是困难的;
3、NIFI确实需要提供一个“工作”视图的概念,从某个起始点,我们可以明确地告诉您有多少记录来自那里,他们去了哪里等等。所以我们多去提交官网提交相关的需求没准后面也会出一个。
有兴趣的伙伴一起研究研究kylo吧,klyo对于nifi的一些封装理念还是非常棒的。





