Data Tech Way: October 2017

How to use TOAD for Hadoop

application logs located in:

C:\Users\*user*\AppData\Roaming\Dell\Toad for Apache Hadoop\*version*\log
C:\Users\*user*\AppData\Roaming\Dell\Toad for Apache Hadoop\*version*\.metadata\.log

1. To connect to Hive via Knox

Hostname: Knoxhost

port 8443

Transport Mode: HTTP

HTTP path: gateway/default/hive

For SSL configuration, check enable SSL.

If you have self created certificate, specify keystore and password

You must input usrname password for LDAP authentication.

2. To connect to Kerberized Hive

check "Enable Kerberos Security" in the Ecosystem Configuration

Type in password, REAM and KDC Host,

Click "check Credentials"

3. To EMR

nothing special

just use JDBC instead of JDBC HA or JDBC connection String.

4. Manual configuration for JDBC connection String with Knox. (one line)

jdbc:hive2://hostname:8443/;ssl=true;transportMod=http;httpPath=gateway/default/hive;

auth=kerberos;kerberosAuthType=fromSubject;

http.header.USERNAME=myusername;http.header.PASSWORD=mpassword

Refer

https://big-data-sciences.blogspot.com/2017/01/how-to-configure-toad-for-hadoop-hive.html

https://www.toadworld.com/products/toad-for-hadoop/w/wiki

How to Implement Federated API and CLI Access Using SAML 2.0 and AD FS

Note 1: On August 12, 2015, I published a follow-up to this post, which is called How to Implement a General Solution for Federated API/CLI Access Using SAML 2.0. Be sure to see that post if you want to implement a general federation solution (not specific to AD FS).

Note 2: This post focuses on NTLM authentication, the default authentication mechanism for AD FS 2.0. If you are using AD FS 3.0—which uses form-based authentication by default—see How to Implement a General Solution for Federated API/CLI Access Using SAML 2.0.

AWS supports identity federation using SAML (Security Assertion Markup Language) 2.0. Using SAML, you can configure your AWS accounts to integrate with your identity provider (IdP). Once configured, your federated users are authenticated and authorized by your organization's IdP, and then can use single sign-on (SSO) to sign in to the AWS Management Console. This not only obviates the need for your users to remember yet another user name and password, but it also streamlines identity management for your administrators. This is great if your federated users want to access the AWS Management Console, but what if they want to use the AWS CLI or programmatically call AWS APIs?

In this blog post, I will show you how you can implement federated API and CLI access for your users. The examples provided use the AWS Python SDK and some additional client-side integration code. If you have federated users that require this type of access, implementing this solution should earn you more than one high five on your next trip to the water cooler.

Let's start by taking a quick look at where we'll end up.

janedoe@Ubuntu64:/tmp$ ./samlapi.py   Username: ADjanedoe  Password: ****************     Please choose the role you would like to assume:  [ 0 ]:  arn:aws:iam::012345678987:role/ADFS-Administrators  [ 1 ]:  arn:aws:iam::012345678987:role/ADFS-Operators  Selection:  1      ---------------------------------------------------------------  Your new access key pair has been stored in the aws configuration   file /home/janedoe/.aws/credentials under the saml profile.  Note that it will expire at 2015-05-26T17:16:20Z.  After this time you may safely rerun this script to refresh your   access key pair.  To use this credential call the aws cli with the --profile option   (e.g. aws --profile saml ec2 describe-instances).  ---------------------------------------------------------------    Simple API example listing all s3 buckets:  [<Bucket: mybucket1>, <Bucket: mybucket2>, <Bucket: mybucket3>,   <Bucket: mybucket4>, <Bucket: mybucket5>]

What are we seeing in the preceding output exactly?

The utility prompts a federated user to enter her Active Directory credentials. These credentials are used to securely authenticate and authorize her against the configured IdP.
The utility inspects the returned SAML assertion and determines the IAM roles she has been authorized to assume. After she selects her desired role, the utility uses the AWS Security Token Service (STS) to retrieve temporary security credentials.
The utility automatically writes these credentials to her local AWS credentials file and she can begin issuing AWS API or CLI calls.

After taking this sample and customizing it for your organization, you'll be able to leverage the power and automation of the AWS API and CLI interfaces by using your organization's credentials, while maintaining the control provided by the AWS Identity and Access Management (IAM) service.

This post will focus on using Microsoft Active Directory Federation Services (AD FS), but if you are using a different provider, don't despair because the basic pattern should work with other common IdPs.

Getting started

To follow along with this post, you must have:

AD FS correctly integrated with your AWS account for console access using only your organizational credentials. See Enabling Federation to AWS using Windows Active Directory, ADFS, and SAML 2.0, if you need instructions on setting this up.
A recent version (2.36+) of the AWS Python SDK installed on your local workstation.
A minimal AWS credentials file (for example, ~/.aws/credentials) with the following contents adjusted to your preferred region and output format.

[default]  output = json  region = us-west-2  aws_access_key_id =  aws_secret_access_key =

Important note #1: An AWS access key pair is not configured in the above structure because the initial AWS STS call is authenticated by the SAML assertion returned by the trusted IdP. All subsequent API/CLI calls are authenticated by the key pair contained within the returned AWS STS token. For additional details, see Giving AWS Console Access to Federated Users Using SAML.

Important note #2: In the past, the various AWS SDK and CLI packages supported both config and credential files. These files have similar but slightly different syntaxes. Credential files are now standard and will be used in this post. If you are still using a config file, you'll need to convert to a credential file in order to avoid errors later in the process.

First, you need to install two of the modules that fall outside the core Python distribution, specifically beautifulsoup4 and requests-ntlm. There are several ways to do this, but the pip utility, which is included in Python 2.7.9+, makes it a very easy task. You simply need to run the following two commands:

pip install beautifulsoup4

and then

pip install requests-ntlm

You should get an output that looks similar to the following screenshot.

The AWS federation process utilizes what is known as an IdP-initiated login. In the last preparatory step, you'll need to determine the specific URL that is being used to initiate this login. Start with your basic IdP-initiated login URL (the one you use for federated access to your various SAML Relying Parties, including the AWS Management Console). In this example, I'm using AD FS 2.0, where the URL takes the form of https://<fqdn>/adfs/ls/IdpInitiatedSignOn.aspx. If I place this URL in a browser's address field, I'm presented with a site selection page, as shown in the following image.

In order to construct the URL you need, concatenate the original IdP-initiated login URL with the query string ?loginToRp=urn:amazon:webservices. The resulting URL should take the form of https://<fqdn>/adfs/ls/IdpInitiatedSignOn.aspx?loginToRp=urn:amazon:webservices. Save this URL because it will be referred to as the idpentryurl variable below. If you place this concatenated URL in a browser's address field you should still reach the AD FS IdP-initiated login process. However, you bypass the site selection page and proceed straight to the AWS Management Console after you authenticate, which is the URL you need for this exercise. If you are interested, this query string is documented on Microsoft's website.

Important note: Be sure to pay close attention to the capitalization of the hostname in the URL. AD FS uses this as part of the authentication process, and any mismatch in capitalization will cause the authentication to fail.

Reviewing the code

Now that you have gone through the steps in the "Getting Started" section above, you are ready to start assembling the integration code that enables federated API and CLI access. I'll explain what's going on one chunk at a time so that you can follow along, and I have included a link to the full code example at the end of this post. First, let's set up some basic imports and variables.

#!/usr/bin/python      import sys   import boto.sts   import boto.s3   import requests   import getpass   import ConfigParser   import base64   import xml.etree.ElementTree as ET   from bs4 import BeautifulSoup   from os.path import expanduser   from urlparse import urlparse, urlunparse   from requests_ntlm import HttpNtlmAuth     ##########################################################################  # Variables      # region: The default AWS region that this script will connect   # to for all API calls   region = 'us-west-2'      # output format: The AWS CLI output format that will be configured in the   # saml profile (affects subsequent CLI calls)   outputformat = 'json'     # awsconfigfile: The file where this script will store the temp   # credentials under the saml profile   awsconfigfile = '/.aws/credentials'     # SSL certificate verification: Whether or not strict certificate   # verification is done, False should only be used for dev/test   sslverification = True      # idpentryurl: The initial URL that starts the authentication process.   idpentryurl = 'https://<fqdn>/adfs/ls/IdpInitiatedSignOn.aspx?loginToRp=urn:amazon:webservices'      ##########################################################################

Adjust the variables in the previous code to suit your specific region (e.g., us-west-2, us-east-1, etc.) and format preferences (i.e., json, text, or table), and then plug in the value noted from the previous section in this post for the idpentryurl. If you're using a self-signed certificate in a trusted testing environment, you'll also need to set sslverification to False, in order to avoid certificate verification failures.

With the basic environment set up, prompt the user and retrieve the user's credentials from standard input. I'll talk later in this post about how we can support other credential forms.

# Get the federated credentials from the user  print "Username:",  username = raw_input()  password = getpass.getpass()  print ''

The comma following the first print "Username", statement might look like a bug, but it is actually a little Python trick to prevent the print statement from appending a newline character. You'll also note the use of the getpass() method, which prevents the password from being echoed out in the output.

Next, we use the Python requests module to assemble the authentication information and formulate the https request to the IdP as shown in the following snippet. Assuming the authentication is successful, the response returned by AD FS will contain the SAML assertion

# Initiate session handler   session = requests.Session()      # Programatically get the SAML assertion   # Set up the NTLM authentication handler by using the provided credential   session.auth = HttpNtlmAuth(username, password, session)      # Opens the initial AD FS URL and follows all of the HTTP302 redirects   response = session.get(idpentryurl, verify=sslverification)      # Debug the response if needed   #print (response.text)

At this point we have the IdP response safely in hand, so let's overwrite and dereference the variables we used to store the user's username and password because they are no longer needed. Python, as a higher-level language, doesn't provide the direct memory manipulation to do this like one might in C, but this does a good job of ensuring the credentials don't linger around.

# Overwrite and delete the credential variables, just for safety  username = '##############################################'  password = '##############################################'  del username  del password

As I noted previously, the SAML assertion that we need can be found in the response from the IdP. The following code uses the BeautifulSoup module to extract it.

# Decode the response and extract the SAML assertion   soup = BeautifulSoup(response.text.decode('utf8'))   assertion = ''      # Look for the SAMLResponse attribute of the input tag (determined by   # analyzing the debug print lines above)   for inputtag in soup.find_all('input'):       if(inputtag.get('name') == 'SAMLResponse'):           #print(inputtag.get('value'))           assertion = inputtag.get('value')

This snippet simply iterates through all of the HTML input tags until it finds the one named SAMLResponse. The Base64-encoded assertion we're after is contained in the value attribute of this tag. Before moving on, we need to go one layer deeper into the SAML authentication response itself and extract the list of the user's authorized roles. This list is stored in the SAML attribute named https://aws.amazon.com/SAML/Attributes/Role, as shown in the following code.

# Parse the returned assertion and extract the authorized roles   awsroles = []   root = ET.fromstring(base64.b64decode(assertion))     for saml2attribute in root.iter('{urn:oasis:names:tc:SAML:2.0:assertion}Attribute'):       if (saml2attribute.get('Name') == 'https://aws.amazon.com/SAML/Attributes/Role'):           for saml2attributevalue in saml2attribute.iter('{urn:oasis:names:tc:SAML:2.0:assertion}AttributeValue'):              awsroles.append(saml2attributevalue.text)           # Note the format of the attribute value should be role_arn,principal_arn   # but lots of blogs list it as principal_arn,role_arn so let's reverse   # them if needed   for awsrole in awsroles:       chunks = awsrole.split(',')       if'saml-provider' in chunks[0]:          newawsrole = chunks[1] + ',' + chunks[0]           index = awsroles.index(awsrole)           awsroles.insert(index, newawsrole)           awsroles.remove(awsrole)

In the first loop above, I'm using the Python xml.etree.ElementTree module to parse the SAML assertion. Though assertions aren't particularly pretty to look at, the specific namespaces and names make finding what you're looking for relatively straightforward.

In the second loop, the code is just doing a bit of proactive error proofing to make sure the Amazon Resource Names (ARNs) contained within the roles are in the right order.

Now we know the roles the user is authorized to assume. Next, the code asks the user which role they want to assume.

# If I have more than one role, ask the user which one they want,   # otherwise just proceed   print ""   if len(awsroles) > 1:       i = 0       print "Please choose the role you would like to assume:"       for awsrole in awsroles:           print '[', i, ']: ', awsrole.split(',')[0]           i += 1         print "Selection: ",       selectedroleindex = raw_input()          # Basic sanity check of input       if int(selectedroleindex) > (len(awsroles) - 1):           print 'You selected an invalid role index, please try again'           sys.exit(0)          role_arn = awsroles[int(selectedroleindex)].split(',')[0]       principal_arn = awsroles[int(selectedroleindex)].split(',')[1]     else:       role_arn = awsroles[0].split(',')[0]       principal_arn = awsroles[0].split(',')[1]

If the user only has one role, it is automatically used. Otherwise, the user will be prompted to select which role they want to use from a simple enumerated list.

Things now start to get quite a bit easier. Because we've already retrieved the SAML assertion and extracted the various ARNs, we just need to call the standard AWS STS service to request AWS temporary security credentials. This is really the crux of the entire utility. We've undertaken all of the effort up to this point to prepare the parameters for the following simple AssumeRolewithSAML API call.

# Use the assertion to get an AWS STS token using Assume Role with SAML  conn = boto.sts.connect_to_region(region)  token = conn.assume_role_with_saml(role_arn, principal_arn, assertion)

Important note: The SAML assertion, which comes from a configured and trusted IdP, serves as the authentication credential for this API call.

Now that we have our credentials from AWS STS, the next step is to insert these credentials into a specific profile within the AWS credentials file as shown in the following code snippet. By doing so, you'll be able to use the temporary security credentials for any number of API or CLI calls until the credentials expire.

# Write the AWS STS token into the AWS credential file  home = expanduser("~")  filename = home + awsconfigfile     # Read in the existing config file  config = ConfigParser.RawConfigParser()  config.read(filename)     # Put the credentials into a specific profile instead of clobbering  # the default credentials  if not config.has_section('saml'):      config.add_section('saml')     config.set('saml', 'output', outputformat)  config.set('saml', 'region', region)  config.set('saml', 'aws_access_key_id', token.credentials.access_key)  config.set('saml', 'aws_secret_access_key', token.credentials.secret_key)  config.set('saml', 'aws_session_token', token.credentials.session_token)     # Write the updated config file  with open(filename, 'w+') as configfile:      config.write(configfile)

In this code snippet, I'm simply using a combination of variables to locate the AWS credentials file, reading in the file, adding or updating a profile called 'saml', and writing the file back out.

Finally, the code gives the federated user a basic idea of what has just happened.

# Give the user some basic info as to what has just happened     print 'nn----------------------------------------------------------------'     print 'Your new access key pair has been stored in the AWS configuration file {0} under the saml profile.'.format(filename)     print 'Note that it will expire at {0}.'.format(token.credentials.expiration)     print 'After this time you may safely rerun this script to refresh your access key pair.'     print 'To use this credential call the AWS CLI with the --profile option (e.g. aws --profile saml ec2 describe-instances).'     print '----------------------------------------------------------------nn'

If your automation is CLI based, you can skip ahead to the next section, "Putting it all together." Alternatively, if your automation is API based, you can simply use the temporary security credentials when making your connections.

# Use the AWS STS token to list all of the S3 buckets  s3conn = boto.s3.connect_to_region(region,                       aws_access_key_id=token.credentials.access_key,                       aws_secret_access_key=token.credentials.secret_key,                       security_token=token.credentials.session_token)     buckets = s3conn.get_all_buckets()     print 'Simple API example listing all s3 buckets:'  print(buckets)

In practice, it would be far better form to place these additional API calls in a separate script that utilizes the credential profile we set up above, but the point of the exercise is that the credentials received from AWS STS are being used to authenticate the API call.

Putting it all together

When you put all of this code together, you see what's shown in the following output.

janedoe@Ubuntu64:/tmp$ ./samlapi.py  Username: ADjanedoe  Password: ****************     Please choose the role you would like to assume:  [ 0 ]:  arn:aws:iam::012345678987:role/ADFS-Administrators  [ 1 ]:  arn:aws:iam::012345678987:role/ADFS-Operators  Selection:  1        ---------------------------------------------------------------Your new access key pair has been stored in the aws configuration file /home/janedoe/.aws/credentials under the saml profile.  Note that it will expire at 2015-05-26T17:16:20Z.  After this time you may safely rerun this script to refresh your access key pair.  To use this credential call the aws cli with the --profile option (e.g. aws --profile saml ec2 describe-instances).  ---------------------------------------------------------------     Simple API example listing all s3 buckets:  [<Bucket: mybucket1>, <Bucket: mybucket2>, <Bucket: mybucket3>, <Bucket: mybucket4>, <Bucket: mybucket5>]

What are we seeing in the preceding output exactly?

The utility prompts a federated user to enter her Active Directory credentials. These credentials are used to securely authenticate and authorize her against the configured IdP.
The utility inspects the returned SAML assertion and determines the IAM roles she has been authorized to assume. After she selects her desire role, the utility uses the AWS Security Token Service (STS) to retrieve temporary security credentials.
The utility automatically writes these credentials to her local AWS credentials file and she can begin issuing AWS API or CLI calls.

You can then issue subsequent API or CLI calls by simply referencing the saml profile, as shown in the following output.

janedoe@Ubuntu64:/tmp$ aws --profile saml --output text s3api list-buckets  BUCKETS   2014-09-09T17:11:48.000Z     mybucket1  BUCKETS   2014-11-03T18:27:35.000Z     mybucket2  BUCKETS   2014-08-13T02:45:28.000Z     mybucket3  BUCKETS   2015-01-14T17:53:47.000Z     mybucket4  BUCKETS   2015-05-19T19:23:25.000Z     mybucket5  OWNER     myowner   128abcedf111aa33cc44ddee5512345abcd6fff4eed

In the end, the browser oriented nature of SAML requires a bit of code and bit of engineering, but it accomplishes the goal: enabling federation-based authentication for API and CLI calls!

Using Other Identity Providers or Authentication Mechanisms

If you are using an IdP other than AD FS, don't despair that your solution was not included in this blog post. The reason I chose the integration technique of using a Python module to emulate browser behavior is that it should be extensible to just about any IdP solution.

Similarly, if you are using other authentication mechanisms such as Kerberos, X.509 certificates, or Multi-Factor Authentication, there are additional handlers within the Python requests module. These additional handlers can be used to enhance the example above after you have the basics working.

If you want to get started quickly, see the full code example. I hope you enjoyed this post, and I look forward to your feedback. I hope you enjoyed this post, and I look forward to your feedback.

– Quint

Introduction to Amazon Redshift

This lab provides a basic understanding of Amazon Redshift. It demonstrates the basic steps required to get started with Redshift including:

· Creating a Redshift cluster

· Loading data into a Redshift cluster

· Performing queries against the data in cluster

Topics covered

By the end of this lab you will be able to:

· Launch a Redshift cluster

· Connect an SQL client to the Amazon Redshift cluster

· Load data from an S3 bucket into the Amazon Redshift cluster

· Run queries against data stored in Amazon Redshift

Amazon Redshift

Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.

Amazon S3

Amazon Simple Storage Service (Amazon S3) makes it simple and practical to collect, store, and analyze data - regardless of format - all at massive scale. S3 is object storage built to store and retrieve any amount of data from anywhere - web sites and mobile apps, corporate applications, and data from IoT sensors.

Prerequisites
Familiarity with relational databases and SQL concepts would be beneficial.

Task 1: Launch an Amazon Redshift Cluster

In this task, you will launch an Amazon Redshift cluster. A cluster is a fully managed data warehouse that consists of a set of compute nodes. Each cluster runs an Amazon Redshift engine and contains one or more databases.

When you launch a cluster, one of the options you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. Node types are available in different sizes. Node size and the number of nodes determine the total storage for a cluster.

In the AWS Management Console, on the Services menu, click Amazon Redshift.

You can also type in the search box to select the AWS Service (eg Redshift) that you wish to use.

Click Launch cluster to open the Redshift Cluster Creation Wizard.
At the CLUSTER DETAILS page enter the following information:

· Cluster identifier:

· Database name:

· Master user name:

· Master user password:

· Confirm password:

Click Continue.

The NODE CONFIGURATION page is displayed. This is where you can select the Node type and the number of nodes in your cluster. For this lab, you will be using the default configuration of a single-node cluster.

If change to multiple node, require at least 2 compute nodes.

Click Continue.

The ADDITIONAL CONFIGURATION page will be displayed. This is where you can configure cluster settings to control encryption, security and networking.

if change the encryption key.

For Choose a VPC, select the other VPC (not the Default VPC).
For VPC security groups, select the Security Group that contains the words RedshiftSG. It will look something like: qls-12345-xxx-RedshiftSG-xxx
For Available roles and select the role that begins with qls-.

This role grants permission for Amazon Redshift to read data from Amazon S3.

Leave all other settings at their default value and click Continue.

The REVIEW page displays information about the cluster that you are about to launch.

You may ignore the warning about the free trial.

Click Launch cluster at the bottom of the screen.
Click Close.

The cluster will take a few minutes to launch. Please continue with the labs steps. There is no need to wait.

Click the name of your cluster (lab).

The cluster configuration will be displayed. Spend a few minutes looking at the properties, but do not click any links.

· Cluster Properties: Contains information about the Cluster including: Name, Type, Node Type, number of Nodes, Zone location, Time and version of the creation as well as other information.

· Cluster Status: Allows you to see the current status of the cluster whether it is available or not and also whether it is currently In Maintenance Mode.

· Cluster Database Properties: Contains information on the Endpoint, which is the DNS address of the cluster, and the port number on which the database accepts connections. These are required when you want to create SQL connections. It also lets you know whether the cluster has a public IP address that can be accessed from the public internet. The JDBC URL and ODBC URL contain the URLs to connect to the cluster via a java database connection or an Oracle database connection client.
Endpoint lab.cd9hri01rnv5.us-west-2.redshift.amazonaws.com:5439 ( authorized )

· Backup, Audit Logging and Maintenance: Contains information on how many days the automated snapshots are retained, whether they are automatically copied to another region, and whether logging is enabled on the cluster.

· Capacity Details: Contains information about the data warehouse node type, number of EC2 Compute Units per node, memory, disk storage, I/O performance as well as the processor architecture of the node type.

· SSH Ingestion Settings: Contains information about the Public Key of the cluster as well as the Public and Private IP addresses of the node.

Task 2: Launch Pgweb to Communicate with your Redshift Cluster

Amazon Redshift can be used via industry-standard SQL. To use Redshift, you require an SQL Client that provides a user interface to type SQL. Any SQL client that supports JDBC or ODBC can be used with Redshift.

For this lab, you will use a web application called Pgweb, which provides a friendly SQL interface to Redshift. The application is written for PostgreSQL, which is why it is called Pgweb. The SQL used in Amazon Redshift is based upon PostgreSQL.

To the left of the instructions you are currently reading, below the Open Console button, copy the IP Address shown as pgwebPublicIP.

The blue button will also copy values shown in this side panel.

Open a new tab in your web browser, paste the IP Address and hit Enter.

You will be presented with a login page for pgweb:
Endpoint lab.cd9hri01rnv5.us-west-2.redshift.amazonaws.com:5439 ( authorized )

Note: default is 5432, change it to 5439!

Return to your browser tab showing the Redshift console.
Click the Endpoint displayed at the top of the window and Copy it to your clipboard. It will be similar to: lab.c3uxk4gpn3xo.us-west-2.redshift.amazonaws.com:5439

If the endpoint is not yet displayed, it is because your cluster is still being created. Wait a minute and click the refresh icon until the Endpoint appears.

Return to the web browser tab with pgweb.
Enter the following information:

· Host: Paste the Endpoint that you copied, but remove from the end

· Username:

· Password:

· Database:

· Port: (Note, this is different to the default)

Click Connect.

If you receive a too many colons error, double-check your Host. You will need to remove from the end of the Host name.

Pgweb will connect to your Amazon Redshift database. If you are not able to connect to the database, double-check the above parameters.

Task 3: Create a Table

In this task, you will execute SQL commands to create a table in Redshift.

Copy this SQL command and Paste it into Pgweb.

CREATE TABLE users

(userid INTEGER NOT NULL,

username CHAR(8),

firstname VARCHAR(30),

lastname VARCHAR(30),

city VARCHAR(30),

state CHAR(2),

email VARCHAR(100),

phone CHAR(14),

likesports BOOLEAN,

liketheatre BOOLEAN,

likeconcerts BOOLEAN,

likejazz BOOLEAN,

likeclassical BOOLEAN,

likeopera BOOLEAN,

likerock BOOLEAN,

likevegas BOOLEAN,

likebroadway BOOLEAN,

likemusicals BOOLEAN

);

This command will create a table called users. It contains name, address and details about the type of music that the user likes.

Click the Run Query button to execute the SQL script.

In the left navigation pane, below Tables, you should now see a users table.
You may ignore the No records found message because no records were queried.

Click on the users table, then click the Structure tab.

The structure of the table is displayed:

Task 4: Load Sample Data from Amazon S3

Amazon Redshift can import data from Amazon S3. Various file formats are supported, fixed-length fields, comma-separated values (CSV) and custom delimiters. The data for this lab is pipe-separated (|).

In Pgweb, click the SQL Query tab at the top of the window.
Delete the existing query, then Copy this SQL command and Paste it into Pgweb:

COPY users FROM 's3://awssampledbuswest2/tickit/allusers_pipe.txt'

CREDENTIALS 'aws_iam_role=YOUR-ROLE'

DELIMITER '|';

Before running this command, you will need to insert the ROLE that Redshift will use to access Amazon S3.
COPY users FROM 's3://awssampledbuswest2/tickit/allusers_pipe.txt'

CREDENTIALS 'aws_iam_role=arn:aws:iam::557653742807:role/qls-1848056-f22f1d2579629cc6-RedshiftRole-1UEBS06ETOXBE'

DELIMITER '|';

To the left of the instructions you are currently reading, below the Open Console button, copy Role. It will start with: arn:aws:iam::
Paste the Role into your Pgweb query, replacing the text YOUR-ROLE.

This 2nd line should now look like: CREDENTIALS 'aws_iam_role=arn:aws:iam...'

Click the Run Query button to execute the SQL script.

The command will take approximately 10 seconds to load 49,990 rows of data.

Click the users table and then click the Rows tab.

You should now see the data that was loaded into Redshift.

Task 5: Query Data

Now that you have data in your Redshift database you can query the data using SQL select statements and queries. If you are familiar with SQL, feel free to try additional commands to query the data.

Return to the SQL Query tab.
Delete the existing query.
Copy and paste this query, then click Run Query:

SELECT COUNT(*) FROM users;

This query returns the number of rows in the users table.

Delete the existing query.
Copy and paste this query, then click Run Query:

SELECT userid, firstname, lastname, city, state

FROM users

WHERE likesports AND NOT likeopera AND state = 'OH'

ORDER BY firstname;

This query displays users in Ohio (OH) who like sports but do not like opera. The list is sorted by their first name.

Delete the existing query.
Copy and paste this query, then click Run Query:

SELECT

city,

COUNT(*) as count

FROM users

WHERE likejazz

GROUP BY city

ORDER BY count DESC

LIMIT 10;

This query shows the Top 10 cities where Jazz-loving users live.

Try to write a query that only displays the firstname and lastname of users who like both Theatre and Classical music, where their last name is Smith.

select firstname, lastname

from users

where liketheatre and likeclassical and lastname='Smith';