data engineer • T-SQL developer • r and python • Power BI and tableau
machine learning enthusiast • data lover
Spark SQL is a component of Apache Spark that works with tabular data.
# Load data from file
df = spark.read.csv("trains.csv", header=True)
# Create temporary table called table1
# Inspect the columns in the t...
Two kinds of variables: val (immutable, can't be reassigned once initialized) and var (mutable).// for comments.
val two: Int = 2
var a: String = "massy"
// you can also use type inference
val one = 1
// println to print
To interact with AWS in Python, there is the Boto3 library.
1. AWS S3
S3 is the AWS Storage solution.
# Generate the boto3 client for interacting with S3
s3 = boto3.client('s3',
region_name='eu-south-1', #region where your re...
Spark = tool for doing parallel computation with large datasets. Spark lets you spread data and computations over clusters with multiple nodes.pyspark = Python package that integrate Spark with Python.
The SparkSession.builder.getOrCreate() method ...
Airflow: platform to program workflows.
DAG: workflow made up of tasks with dependencies.
Define a DAG in Python:
from airflow.models import DAG
from datetime import datetime
#create a default arguments dictionary (optional)
default_arguments = ...
In the Azure Portal, select your Storage Account, go to "Tables", and write down the Table's URL.
Then, in the same page go to "Access keys", then click "Show keys" and copy the key.
Open Power BI Desktop, click "Get data" -> "More..."