How to use TOAD for Hadoop
C:\Users\*user*\AppData\Roaming\Dell\Toad for Apache Hadoop\*version*\log
C:\Users\*user*\AppData\Roaming\Dell\Toad for Apache Hadoop\*version*\.metadata\.log
Note 1: On August 12, 2015, I published a follow-up to this post, which is called How to Implement a General Solution for Federated API/CLI Access Using SAML 2.0. Be sure to see that post if you want to implement a general federation solution (not specific to AD FS).
Note 2: This post focuses on NTLM authentication, the default authentication mechanism for AD FS 2.0. If you are using AD FS 3.0—which uses form-based authentication by default—see How to Implement a General Solution for Federated API/CLI Access Using SAML 2.0.
AWS supports identity federation using SAML (Security Assertion Markup Language) 2.0. Using SAML, you can configure your AWS accounts to integrate with your identity provider (IdP). Once configured, your federated users are authenticated and authorized by your organization's IdP, and then can use single sign-on (SSO) to sign in to the AWS Management Console. This not only obviates the need for your users to remember yet another user name and password, but it also streamlines identity management for your administrators. This is great if your federated users want to access the AWS Management Console, but what if they want to use the AWS CLI or programmatically call AWS APIs?
In this blog post, I will show you how you can implement federated API and CLI access for your users. The examples provided use the AWS Python SDK and some additional client-side integration code. If you have federated users that require this type of access, implementing this solution should earn you more than one high five on your next trip to the water cooler.
Let's start by taking a quick look at where we'll end up.
janedoe@Ubuntu64:/tmp$ ./samlapi.py Username: ADjanedoe Password: **************** Please choose the role you would like to assume: [ 0 ]: arn:aws:iam::012345678987:role/ADFS-Administrators [ 1 ]: arn:aws:iam::012345678987:role/ADFS-Operators Selection: 1 --------------------------------------------------------------- Your new access key pair has been stored in the aws configuration file /home/janedoe/.aws/credentials under the saml profile. Note that it will expire at 2015-05-26T17:16:20Z. After this time you may safely rerun this script to refresh your access key pair. To use this credential call the aws cli with the --profile option (e.g. aws --profile saml ec2 describe-instances). --------------------------------------------------------------- Simple API example listing all s3 buckets: [<Bucket: mybucket1>, <Bucket: mybucket2>, <Bucket: mybucket3>, <Bucket: mybucket4>, <Bucket: mybucket5>]
What are we seeing in the preceding output exactly?
After taking this sample and customizing it for your organization, you'll be able to leverage the power and automation of the AWS API and CLI interfaces by using your organization's credentials, while maintaining the control provided by the AWS Identity and Access Management (IAM) service.
This post will focus on using Microsoft Active Directory Federation Services (AD FS), but if you are using a different provider, don't despair because the basic pattern should work with other common IdPs.
To follow along with this post, you must have:
[default] output = json region = us-west-2 aws_access_key_id = aws_secret_access_key =
Important note #1: An AWS access key pair is not configured in the above structure because the initial AWS STS call is authenticated by the SAML assertion returned by the trusted IdP. All subsequent API/CLI calls are authenticated by the key pair contained within the returned AWS STS token. For additional details, see Giving AWS Console Access to Federated Users Using SAML.
Important note #2: In the past, the various AWS SDK and CLI packages supported both config and credential files. These files have similar but slightly different syntaxes. Credential files are now standard and will be used in this post. If you are still using a config file, you'll need to convert to a credential file in order to avoid errors later in the process.
First, you need to install two of the modules that fall outside the core Python distribution, specifically beautifulsoup4 and requests-ntlm. There are several ways to do this, but the pip utility, which is included in Python 2.7.9+, makes it a very easy task. You simply need to run the following two commands:
pip install beautifulsoup4
and then
pip install requests-ntlm
You should get an output that looks similar to the following screenshot.
The AWS federation process utilizes what is known as an IdP-initiated login. In the last preparatory step, you'll need to determine the specific URL that is being used to initiate this login. Start with your basic IdP-initiated login URL (the one you use for federated access to your various SAML Relying Parties, including the AWS Management Console). In this example, I'm using AD FS 2.0, where the URL takes the form of https://<fqdn>/adfs/ls/IdpInitiatedSignOn.aspx. If I place this URL in a browser's address field, I'm presented with a site selection page, as shown in the following image.
In order to construct the URL you need, concatenate the original IdP-initiated login URL with the query string ?loginToRp=urn:amazon:webservices. The resulting URL should take the form of https://<fqdn>/adfs/ls/IdpInitiatedSignOn.aspx?loginToRp=urn:amazon:webservices. Save this URL because it will be referred to as the idpentryurl variable below. If you place this concatenated URL in a browser's address field you should still reach the AD FS IdP-initiated login process. However, you bypass the site selection page and proceed straight to the AWS Management Console after you authenticate, which is the URL you need for this exercise. If you are interested, this query string is documented on Microsoft's website.
Important note: Be sure to pay close attention to the capitalization of the hostname in the URL. AD FS uses this as part of the authentication process, and any mismatch in capitalization will cause the authentication to fail.
Now that you have gone through the steps in the "Getting Started" section above, you are ready to start assembling the integration code that enables federated API and CLI access. I'll explain what's going on one chunk at a time so that you can follow along, and I have included a link to the full code example at the end of this post. First, let's set up some basic imports and variables.
#!/usr/bin/python import sys import boto.sts import boto.s3 import requests import getpass import ConfigParser import base64 import xml.etree.ElementTree as ET from bs4 import BeautifulSoup from os.path import expanduser from urlparse import urlparse, urlunparse from requests_ntlm import HttpNtlmAuth ########################################################################## # Variables # region: The default AWS region that this script will connect # to for all API calls region = 'us-west-2' # output format: The AWS CLI output format that will be configured in the # saml profile (affects subsequent CLI calls) outputformat = 'json' # awsconfigfile: The file where this script will store the temp # credentials under the saml profile awsconfigfile = '/.aws/credentials' # SSL certificate verification: Whether or not strict certificate # verification is done, False should only be used for dev/test sslverification = True # idpentryurl: The initial URL that starts the authentication process. idpentryurl = 'https://<fqdn>/adfs/ls/IdpInitiatedSignOn.aspx?loginToRp=urn:amazon:webservices' ##########################################################################
Adjust the variables in the previous code to suit your specific region (e.g., us-west-2, us-east-1, etc.) and format preferences (i.e., json, text, or table), and then plug in the value noted from the previous section in this post for the idpentryurl. If you're using a self-signed certificate in a trusted testing environment, you'll also need to set sslverification to False, in order to avoid certificate verification failures.
With the basic environment set up, prompt the user and retrieve the user's credentials from standard input. I'll talk later in this post about how we can support other credential forms.
# Get the federated credentials from the user print "Username:", username = raw_input() password = getpass.getpass() print ''
The comma following the first print "Username", statement might look like a bug, but it is actually a little Python trick to prevent the print statement from appending a newline character. You'll also note the use of the getpass() method, which prevents the password from being echoed out in the output.
Next, we use the Python requests module to assemble the authentication information and formulate the https request to the IdP as shown in the following snippet. Assuming the authentication is successful, the response returned by AD FS will contain the SAML assertion
# Initiate session handler session = requests.Session() # Programatically get the SAML assertion # Set up the NTLM authentication handler by using the provided credential session.auth = HttpNtlmAuth(username, password, session) # Opens the initial AD FS URL and follows all of the HTTP302 redirects response = session.get(idpentryurl, verify=sslverification) # Debug the response if needed #print (response.text)
At this point we have the IdP response safely in hand, so let's overwrite and dereference the variables we used to store the user's username and password because they are no longer needed. Python, as a higher-level language, doesn't provide the direct memory manipulation to do this like one might in C, but this does a good job of ensuring the credentials don't linger around.
# Overwrite and delete the credential variables, just for safety username = '##############################################' password = '##############################################' del username del password
As I noted previously, the SAML assertion that we need can be found in the response from the IdP. The following code uses the BeautifulSoup module to extract it.
# Decode the response and extract the SAML assertion soup = BeautifulSoup(response.text.decode('utf8')) assertion = '' # Look for the SAMLResponse attribute of the input tag (determined by # analyzing the debug print lines above) for inputtag in soup.find_all('input'): if(inputtag.get('name') == 'SAMLResponse'): #print(inputtag.get('value')) assertion = inputtag.get('value')
This snippet simply iterates through all of the HTML input tags until it finds the one named SAMLResponse. The Base64-encoded assertion we're after is contained in the value attribute of this tag. Before moving on, we need to go one layer deeper into the SAML authentication response itself and extract the list of the user's authorized roles. This list is stored in the SAML attribute named https://aws.amazon.com/SAML/Attributes/Role, as shown in the following code.
# Parse the returned assertion and extract the authorized roles awsroles = [] root = ET.fromstring(base64.b64decode(assertion)) for saml2attribute in root.iter('{urn:oasis:names:tc:SAML:2.0:assertion}Attribute'): if (saml2attribute.get('Name') == 'https://aws.amazon.com/SAML/Attributes/Role'): for saml2attributevalue in saml2attribute.iter('{urn:oasis:names:tc:SAML:2.0:assertion}AttributeValue'): awsroles.append(saml2attributevalue.text) # Note the format of the attribute value should be role_arn,principal_arn # but lots of blogs list it as principal_arn,role_arn so let's reverse # them if needed for awsrole in awsroles: chunks = awsrole.split(',') if'saml-provider' in chunks[0]: newawsrole = chunks[1] + ',' + chunks[0] index = awsroles.index(awsrole) awsroles.insert(index, newawsrole) awsroles.remove(awsrole)
In the first loop above, I'm using the Python xml.etree.ElementTree module to parse the SAML assertion. Though assertions aren't particularly pretty to look at, the specific namespaces and names make finding what you're looking for relatively straightforward.
In the second loop, the code is just doing a bit of proactive error proofing to make sure the Amazon Resource Names (ARNs) contained within the roles are in the right order.
Now we know the roles the user is authorized to assume. Next, the code asks the user which role they want to assume.
# If I have more than one role, ask the user which one they want, # otherwise just proceed print "" if len(awsroles) > 1: i = 0 print "Please choose the role you would like to assume:" for awsrole in awsroles: print '[', i, ']: ', awsrole.split(',')[0] i += 1 print "Selection: ", selectedroleindex = raw_input() # Basic sanity check of input if int(selectedroleindex) > (len(awsroles) - 1): print 'You selected an invalid role index, please try again' sys.exit(0) role_arn = awsroles[int(selectedroleindex)].split(',')[0] principal_arn = awsroles[int(selectedroleindex)].split(',')[1] else: role_arn = awsroles[0].split(',')[0] principal_arn = awsroles[0].split(',')[1]
If the user only has one role, it is automatically used. Otherwise, the user will be prompted to select which role they want to use from a simple enumerated list.
Things now start to get quite a bit easier. Because we've already retrieved the SAML assertion and extracted the various ARNs, we just need to call the standard AWS STS service to request AWS temporary security credentials. This is really the crux of the entire utility. We've undertaken all of the effort up to this point to prepare the parameters for the following simple AssumeRolewithSAML API call.
# Use the assertion to get an AWS STS token using Assume Role with SAML conn = boto.sts.connect_to_region(region) token = conn.assume_role_with_saml(role_arn, principal_arn, assertion)
Important note: The SAML assertion, which comes from a configured and trusted IdP, serves as the authentication credential for this API call.
Now that we have our credentials from AWS STS, the next step is to insert these credentials into a specific profile within the AWS credentials file as shown in the following code snippet. By doing so, you'll be able to use the temporary security credentials for any number of API or CLI calls until the credentials expire.
# Write the AWS STS token into the AWS credential file home = expanduser("~") filename = home + awsconfigfile # Read in the existing config file config = ConfigParser.RawConfigParser() config.read(filename) # Put the credentials into a specific profile instead of clobbering # the default credentials if not config.has_section('saml'): config.add_section('saml') config.set('saml', 'output', outputformat) config.set('saml', 'region', region) config.set('saml', 'aws_access_key_id', token.credentials.access_key) config.set('saml', 'aws_secret_access_key', token.credentials.secret_key) config.set('saml', 'aws_session_token', token.credentials.session_token) # Write the updated config file with open(filename, 'w+') as configfile: config.write(configfile)
In this code snippet, I'm simply using a combination of variables to locate the AWS credentials file, reading in the file, adding or updating a profile called 'saml', and writing the file back out.
Finally, the code gives the federated user a basic idea of what has just happened.
# Give the user some basic info as to what has just happened print 'nn----------------------------------------------------------------' print 'Your new access key pair has been stored in the AWS configuration file {0} under the saml profile.'.format(filename) print 'Note that it will expire at {0}.'.format(token.credentials.expiration) print 'After this time you may safely rerun this script to refresh your access key pair.' print 'To use this credential call the AWS CLI with the --profile option (e.g. aws --profile saml ec2 describe-instances).' print '----------------------------------------------------------------nn'
If your automation is CLI based, you can skip ahead to the next section, "Putting it all together." Alternatively, if your automation is API based, you can simply use the temporary security credentials when making your connections.
# Use the AWS STS token to list all of the S3 buckets s3conn = boto.s3.connect_to_region(region, aws_access_key_id=token.credentials.access_key, aws_secret_access_key=token.credentials.secret_key, security_token=token.credentials.session_token) buckets = s3conn.get_all_buckets() print 'Simple API example listing all s3 buckets:' print(buckets)
In practice, it would be far better form to place these additional API calls in a separate script that utilizes the credential profile we set up above, but the point of the exercise is that the credentials received from AWS STS are being used to authenticate the API call.
When you put all of this code together, you see what's shown in the following output.
janedoe@Ubuntu64:/tmp$ ./samlapi.py Username: ADjanedoe Password: **************** Please choose the role you would like to assume: [ 0 ]: arn:aws:iam::012345678987:role/ADFS-Administrators [ 1 ]: arn:aws:iam::012345678987:role/ADFS-Operators Selection: 1 ---------------------------------------------------------------Your new access key pair has been stored in the aws configuration file /home/janedoe/.aws/credentials under the saml profile. Note that it will expire at 2015-05-26T17:16:20Z. After this time you may safely rerun this script to refresh your access key pair. To use this credential call the aws cli with the --profile option (e.g. aws --profile saml ec2 describe-instances). --------------------------------------------------------------- Simple API example listing all s3 buckets: [<Bucket: mybucket1>, <Bucket: mybucket2>, <Bucket: mybucket3>, <Bucket: mybucket4>, <Bucket: mybucket5>]
What are we seeing in the preceding output exactly?
You can then issue subsequent API or CLI calls by simply referencing the saml profile, as shown in the following output.
janedoe@Ubuntu64:/tmp$ aws --profile saml --output text s3api list-buckets BUCKETS 2014-09-09T17:11:48.000Z mybucket1 BUCKETS 2014-11-03T18:27:35.000Z mybucket2 BUCKETS 2014-08-13T02:45:28.000Z mybucket3 BUCKETS 2015-01-14T17:53:47.000Z mybucket4 BUCKETS 2015-05-19T19:23:25.000Z mybucket5 OWNER myowner 128abcedf111aa33cc44ddee5512345abcd6fff4eed
In the end, the browser oriented nature of SAML requires a bit of code and bit of engineering, but it accomplishes the goal: enabling federation-based authentication for API and CLI calls!
If you are using an IdP other than AD FS, don't despair that your solution was not included in this blog post. The reason I chose the integration technique of using a Python module to emulate browser behavior is that it should be extensible to just about any IdP solution.
Similarly, if you are using other authentication mechanisms such as Kerberos, X.509 certificates, or Multi-Factor Authentication, there are additional handlers within the Python requests module. These additional handlers can be used to enhance the example above after you have the basics working.
If you want to get started quickly, see the full code example. I hope you enjoyed this post, and I look forward to your feedback. I hope you enjoyed this post, and I look forward to your feedback.
– Quint
This lab provides a basic understanding of Amazon Redshift. It demonstrates the basic steps required to get started with Redshift including:
· Creating a Redshift cluster
· Loading data into a Redshift cluster
· Performing queries against the data in cluster
Topics covered
By the end of this lab you will be able to:
· Launch a Redshift cluster
· Connect an SQL client to the Amazon Redshift cluster
· Load data from an S3 bucket into the Amazon Redshift cluster
· Run queries against data stored in Amazon Redshift
Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.
Amazon Simple Storage Service (Amazon S3) makes it simple and practical to collect, store, and analyze data - regardless of format - all at massive scale. S3 is object storage built to store and retrieve any amount of data from anywhere - web sites and mobile apps, corporate applications, and data from IoT sensors.
Prerequisites
Familiarity with relational databases and SQL concepts would be beneficial.
Task 1: Launch an Amazon Redshift Cluster
In this task, you will launch an Amazon Redshift cluster. A cluster is a fully managed data warehouse that consists of a set of compute nodes. Each cluster runs an Amazon Redshift engine and contains one or more databases.
When you launch a cluster, one of the options you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. Node types are available in different sizes. Node size and the number of nodes determine the total storage for a cluster.
You can also type in the search box to select the AWS Service (eg Redshift) that you wish to use.
· Cluster identifier:
· Database name:
· Master user name:
· Master user password:
The NODE CONFIGURATION page is displayed. This is where you can select the Node type and the number of nodes in your cluster. For this lab, you will be using the default configuration of a single-node cluster.
If change to multiple node, require at least 2 compute nodes.
The ADDITIONAL CONFIGURATION page will be displayed. This is where you can configure cluster settings to control encryption, security and networking.
if change the encryption key.
This role grants permission for Amazon Redshift to read data from Amazon S3.
The REVIEW page displays information about the cluster that you are about to launch.
You may ignore the warning about the free trial.
The cluster will take a few minutes to launch. Please continue with the labs steps. There is no need to wait.
The cluster configuration will be displayed. Spend a few minutes looking at the properties, but do not click any links.
· Cluster Properties: Contains information about the Cluster including: Name, Type, Node Type, number of Nodes, Zone location, Time and version of the creation as well as other information.
· Cluster Status: Allows you to see the current status of the cluster whether it is available or not and also whether it is currently In Maintenance Mode.
· Cluster Database Properties: Contains information on the Endpoint, which is the DNS address of the cluster, and the port number on which the database accepts connections. These are required when you want to create SQL connections. It also lets you know whether the cluster has a public IP address that can be accessed from the public internet. The JDBC URL and ODBC URL contain the URLs to connect to the cluster via a java database connection or an Oracle database connection client.
Endpoint lab.cd9hri01rnv5.us-west-2.redshift.amazonaws.com:5439 ( authorized )
· Backup, Audit Logging and Maintenance: Contains information on how many days the automated snapshots are retained, whether they are automatically copied to another region, and whether logging is enabled on the cluster.
· Capacity Details: Contains information about the data warehouse node type, number of EC2 Compute Units per node, memory, disk storage, I/O performance as well as the processor architecture of the node type.
· SSH Ingestion Settings: Contains information about the Public Key of the cluster as well as the Public and Private IP addresses of the node.
Task 2: Launch Pgweb to Communicate with your Redshift Cluster
Amazon Redshift can be used via industry-standard SQL. To use Redshift, you require an SQL Client that provides a user interface to type SQL. Any SQL client that supports JDBC or ODBC can be used with Redshift.
For this lab, you will use a web application called Pgweb, which provides a friendly SQL interface to Redshift. The application is written for PostgreSQL, which is why it is called Pgweb. The SQL used in Amazon Redshift is based upon PostgreSQL.
The blue button will also copy values shown in this side panel.
You will be presented with a login page for pgweb:
Endpoint lab.cd9hri01rnv5.us-west-2.redshift.amazonaws.com:5439 ( authorized )
Note: default is 5432, change it to 5439!
If the endpoint is not yet displayed, it is because your cluster is still being created. Wait a minute and click the refresh icon until the Endpoint appears.
· Host: Paste the Endpoint that you copied, but remove from the end
· Username:
· Password:
· Database:
· Port: (Note, this is different to the default)
If you receive a too many colons error, double-check your Host. You will need to remove from the end of the Host name.
Pgweb will connect to your Amazon Redshift database. If you are not able to connect to the database, double-check the above parameters.
Task 3: Create a Table
In this task, you will execute SQL commands to create a table in Redshift.
CREATE TABLE users
(userid INTEGER NOT NULL,
username CHAR(8),
firstname VARCHAR(30),
lastname VARCHAR(30),
city VARCHAR(30),
state CHAR(2),
email VARCHAR(100),
phone CHAR(14),
likesports BOOLEAN,
liketheatre BOOLEAN,
likeconcerts BOOLEAN,
likejazz BOOLEAN,
likeclassical BOOLEAN,
likeopera BOOLEAN,
likerock BOOLEAN,
likevegas BOOLEAN,
likebroadway BOOLEAN,
likemusicals BOOLEAN
);
This command will create a table called users. It contains name, address and details about the type of music that the user likes.
In the left navigation pane, below Tables, you should now see a users table.
You may ignore the No records found message because no records were queried.
The structure of the table is displayed:
Task 4: Load Sample Data from Amazon S3
Amazon Redshift can import data from Amazon S3. Various file formats are supported, fixed-length fields, comma-separated values (CSV) and custom delimiters. The data for this lab is pipe-separated (|).
COPY users FROM 's3://awssampledbuswest2/tickit/allusers_pipe.txt'
CREDENTIALS 'aws_iam_role=YOUR-ROLE'
DELIMITER '|';
Before running this command, you will need to insert the ROLE that Redshift will use to access Amazon S3.
COPY users FROM 's3://awssampledbuswest2/tickit/allusers_pipe.txt'
CREDENTIALS 'aws_iam_role=arn:aws:iam::557653742807:role/qls-1848056-f22f1d2579629cc6-RedshiftRole-1UEBS06ETOXBE'
This 2nd line should now look like: CREDENTIALS 'aws_iam_role=arn:aws:iam...'
The command will take approximately 10 seconds to load 49,990 rows of data.
You should now see the data that was loaded into Redshift.
Task 5: Query Data
Now that you have data in your Redshift database you can query the data using SQL select statements and queries. If you are familiar with SQL, feel free to try additional commands to query the data.
SELECT COUNT(*) FROM users;
This query returns the number of rows in the users table.
SELECT userid, firstname, lastname, city, state
FROM users
WHERE likesports AND NOT likeopera AND state = 'OH'
ORDER BY firstname;
This query displays users in Ohio (OH) who like sports but do not like opera. The list is sorted by their first name.
SELECT
city,
COUNT(*) as count
FROM users
WHERE likejazz
GROUP BY city
ORDER BY count DESC
LIMIT 10;
This query shows the Top 10 cities where Jazz-loving users live.
Try to write a query that only displays the firstname and lastname of users who like both Theatre and Classical music, where their last name is Smith.
select firstname, lastname
from users
where liketheatre and likeclassical and lastname='Smith';
https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
Create snapshot
Client http://www.sql-workbench.net/manual/install.html
JDBC
https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html
ODBC
https://docs.aws.amazon.com/redshift/latest/mgmt/configure-odbc-connection.html
Redshift Forum
https://forums.aws.amazon.com/forum.jspa?forumID=155