مرکز منطقه ای اطلاع رساني علوم و فناوري - Distributed life cycle scheduling for cascaded data processing

Abstract :

According to current market trends, there is a huge demand to process large volume, velocity and variety of data. Real time and static data processing needs to be compliant to big data standards to provide a promising future. For data processing based business solutions, at each intermediate stage of processing, data correctness is important. However in real life scenario various sources of data corruption can be encountered. In this paper we are targeting on solution that can handle data corruption, ensures fault tolerance to a maximum extent and provides synchronization among processing entities in distributed environment. Here we have prototyped our proposal on top of an open source framework Trident, a big data technology to process real time data. The proposal can be efficiently used for cascaded data processing use cases (output of one processing chain is input to another processing chain). Common use cases adhering to data correctness are data interpolation, statistics management, billing software, data fraud analysis etc. Current paper describes an architecture which enhances the data processing capability of current Trident implementation in distributed environment for cascaded or interdependent data processing use cases. Our solution drives life cycle initiation in a transactional manner for the processing flow based on the status of individual processing nodes. This approach provides global chain level initialization and avoids data loss or data duplication due to external entities. Here we have focused on corruption due to initialization latency which can occur due to external entity connection discrepancy/external protocol response dependency/ third party software initialization dependency. The proposed solution can prove useful to handle both real time and static data for providing synchronization, fault tolerance and reliability.