End-to-End-Data-Engineering

Tech Stack: Airflow, Cassandra DB, Kafka, Data Lake, MongoDB, Python, Redshift, Javascript, AWS(S3, EC2, SQS)
Github URL: Project Link

Data analysis and duties to collect and store data from different sources fall under the category of data engineering. After that, process those data to make clean data that can be used in subsequent operations like data visualizations, business analytics, data science solutions, etc. Data Science becomes more productive when data engineering is used. If such a sector does not exist, it will take longer to prepare data analysis to address complicated business issues. Therefore, data engineering demands a thorough understanding of technology, and tools, and faster, more reliable execution of complex datasets. To enable data-driven models like ML models and data analysis, data engineering aims to provide controlled, consistent data flow. The data as mentioned above flow can pass via several groups and organizations. We employ a technique known as a data pipeline to achieve the data flow. The system has independent programs that perform various operations on data that has been saved. The design, upkeep, expansion, and construction support of data pipelines fall under the purview of data engineering. Building data platforms are numerous data engineering teams. There are too many businesses to manage with just one pipeline for SQL database data storage.

Data engineering, which I feel to be a gem-like technology to understand as a machine learning engineer, will soon become the most essential technology. The automation of the data pipeline is crucial for production-level operations. In order to maintain the ETL process, I'm setting up a repository where I'll be updating the end-to-end data automation pipeline using some incredible tools like Web Scraping using Scrapy, Selenium, Postgres, Apache Cassandra Database, and MongoDB. Using AWS Redshift for data warehousing, AWS S3 and Spark for data lakes, and the Docker engine for data pipelines. You guys can adhere to this repository to learn more.