Read more stories on Hashnode
Articles with this tag
Spark SQL is a component of Apache Spark that works with tabular data.
# Load data from file
df = spark.read.csv("trains.csv", header=True)
To interact with AWS in Python, there is the Boto3 library.
1. AWS S3
S3 is the AWS Storage solution.
# Generate the boto3 client for...
Spark = tool for doing parallel computation with large datasets. Spark lets you spread data and computations over clusters with multiple nodes.pyspark...
Airflow: platform to program workflows.
DAG: workflow made up of tasks with dependencies.
Define a DAG in Python:
from airflow.models import DAG
Download Data using curl
curl -O https://websitename.com/file001.txt
#-O -> download file with it's name
1. Importing Data from Flat Files and Spreadsheets
read_csv for all flat files
import pandas as pd
data = pd.read_csv('file.csv')