Skip to main content
 print this page

Reading Class

Amorphic Datalake stores data in many formats which can be backed by Relational Database (Redshift/Aurora only for structured data) or object storage s3. However, the platform uses a structure to store the data for better organization.

With the Read class you can orchestrate the reading of data in more elegant way. You can either use python or pyspark as backend processing.

Reading in python-shell

The following class will return the pandas dataframe of the data.

Reading from S3

class amorphicutils.python.read.Read(bucket_name, region=None, logger=None)

Class to read data from Amorphic

__init__(bucket_name, region=None, logger=None)

Initialize the class with dataset specific details

  • Parameters:bucket_name – name of the bucket
>>> reader = Read("dlz_bucket")

list_object(domain_name, dataset_name)

List the objects for specific datasets

  • Parameters:
    • domain_name – domain name of the dataset
    • dataset_name – dataset name
  • Returns: list of objects from s3
>>> reader = Read("dlz_bucket")
>>> reader.list_object("testdomain", "testdataset")

read_csv_data(domain_name, dataset_name, schema=None, header=False, delimiter=',', upload_date=None, path=None, **kwargs)

Read csv data from s3 using pandas dataframe read api and generate pandas dataframe

  • Parameters:
    • domain_name – domain name of the dataset
    • dataset_name – dataset name
    • schema – List of col names of the data.
    • header – True if data files contains header. Default: False
    • delimiter – delimiter in the dataset. Default: “,”
    • upload_date – upload date timestamp.
    • path – Path of the file to read from.
    • kwargs – Optional arguments available for python pandas csv read
  • Type: list(str)
  • Returns:
>>> reader = Read("dlz_bucket")
>>> df = reader.read_csv_data("testdomain", "testdataset", upload_date="1578305347")

read_excel(domain_name, dataset_name, sheet_name=0, header=False, schema=None, upload_date=None, path=None, **kwargs)

Read data from excel files and return pandas dataframe

  • Parameters:
    • domain_name – domain name of the dataset
    • dataset_name – dataset name
    • sheet_name – sheet name or indices to read data from. Default: 0
    • header – True if data files contains header. Default: False
    • schema – List of col names of the data.
    • upload_date – upload date timestamp.
    • path – Path of the file to read from.
    • kwargs – Optional arguments available for python pandas excel read
  • Returns:
>>> amorphic_reader = Read(bucket_name="dlz_bucket")
>>> result = amorphic_reader.read_excel(domain_name="testdomain", dataset_name="testdataset", header=True)

read_json(domain_name, dataset_name, upload_date=None, path=None, **kwargs)

Read data from excel files and return pandas dataframe

  • Parameters:
    • domain_name – domain name of the dataset
    • dataset_name – dataset name
    • upload_date – upload date timestamp.
    • path – Path of the file to read from.
    • kwargs – Optional arguments available for python pandas json read
  • Returns:
>>> amorphic_reader = Read(bucket_name="dlz_bucket")
>>> result = amorphic_reader.read_json(domain_name="testdomain", dataset_name="testdataset")

Reading in pyspark

The following class will return spark dataframe of the data

Reading from S3

class amorphicutils.pyspark.read.Read(bucket_name, spark, region=None, logger=None)

Class to read data from Amorphic

__init__(bucket_name, spark, region=None, logger=None)

Initialize the class with dataset specific details

  • Parameters:
    • bucket_name – name of the bucket (dlz)
    • spark – SparkContext
>>> reader = Read("dlz_bucket", spark_context)

list_object(domain_name, dataset_name)

List the objects for specific datasets

  • Parameters:
    • domain_name – domain name of the dataset
    • dataset_name – dataset name
  • Returns: list of objects from s3
>>> reader = Read("dlz_bucket", spark=spark_context)
>>> reader.list_object("testdomain", "testdataset")

Reading from Data Warehouse

class amorphicutils.pyspark.read.DwhRead(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)

Class to read data from Datawarehouse(Redshift/Aurora)

__init__(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)

Initialize class with required parameters for connecting to data warehouse.

  • Parameters:
    • dwh_type – Is it “redshift” or “aurora”
    • dwh_host – Hostname for DWH
    • dwh_port – Port for DWH
    • dwh_db – Database name to connect. ex. cdap
    • dwh_user – Username to use for connection
    • dwh_pass – Password for the user
    • tmp_dir – Temp directory for store intermediate result

read_from_redshift(glue_context, domain_name, dataset_name, **kwargs)

Return response with data from Redshift

  • Parameters:
    • glue_context – GlueContext
    • domain_name – Domain name of dataset
    • dataset_name – Dataset name
    • kwargs – Extra params like: hashfield
  • Returns:
>>> dwh_reader = DwhRead("redshift", DWH_HOST, DWH_PORT, DWH_DB, dwh_user, dwh_pass, tmp_dir)
>>> response = dwh_reader.read_from_redshift(glue_context, domain_name, dataset_name)