Tuesday, May 30, 2017

skew table in Hive

If you know a column is going to have heavy skew, you can specify this in the table's schema, for example:
CREATE TABLE Customers (
id int,
username string,
zip int
)
SKEWED BY (zip) ON (57701, 57702)
STORED as DIRECTORIES;


By specifying the values with heavy skew, Hive will split those out into separate files automatically and
take this fact into account during queries so that it can skip whole files if possible.

In the Customers table above, records with a zip of 57701 or 57702 will be stored in separate files
because the assumption is that there will be a large number of customers in those two ZIP codes.

Examine the properties of table in the detailed table information obtained from DESCRIBE FORMATTED Customers command.

External Table with ORC FileFomat & Snappy Compressed

DROP TABLE IF EXISTS User_ORC;

CREATE EXTERNAL TABLE User_ORC(
    first_name VARCHAR(64),
    last_name VARCHAR(64),
    company_name VARCHAR(64),
    address STRUCT<zip:INT, street:STRING>,
    country VARCHAR(64),
    city VARCHAR(32),
    state VARCHAR(32),
    post INT,
    phone_nos ARRAY<STRING>,
    mail MAP<STRING, STRING>,
    web_address VARCHAR(64)
    )
    COMMENT 'Temporary ORC table for testing purpose'
    STORED AS ORC
    LOCATION '/user/hive/orc/user'
       TBLPROPERTIES ("orc.compress"="SNAPPY");


INSERT OVERWRITE TABLE User_ORC SELECT * FROM user;

Formatted Description of the USER_ORC table is given below.

DESCRIBE FORMATTED User_ORC

Update Security Groups Automatically Using AWS Lambda

Update Security Groups Automatically Using AWS Lambda

Lab Overview

Overview

Security is a top priority for Amazon Web Services (AWS). AWS provides many tools and services to meet your unique security needs. This lab will present a solution to enhance your security (one of many). The lab walks you through a method to automatically update your Virtual Private Cloud (VPC) Security Groups to only allow access from Amazon CloudFront and AWS Web Application Firewall (WAF). Defining Security Groups rules this way prevents malicious requests from by-passing AWS WAF security rules and accessing your EC2 instances directly.

To only allow traffic that originates from Amazon CloudFront and AWS WAF's IP range, you need to be informed of AWS IP changes. AWS notifies users of service IP changes through a public Simple Notification Service topic that gives service IP ranges in json format. Leveraging the integration between Amazon SNS and AWS Lambda, this lab demonstrates a way to automatically update security groups with these new IPs.

Topics Covered

After completing this lab, you should be able to:

  • Create VPC Security Groups
  • Create IAM Policy
  • Create a Lambda function
  • Test Lambda function with sample events
  • Subscribe Lambda function to SNS topic

Technical knowledge prerequisites

This lab is intended for AWS learners. To successfully complete this lab, you should be familiar with AWS Services including Amazon EC2, VPC Security Groups, Identify and Access Management (IAM) Roles and Policies and Amazon Simple Notification Service (SNS). You should be comfortable logging into and using the AWS Management Console.

What is AWS Lambda?

Lambda is a compute service that provides resizable compute capacity in the cloud to make web-scale computing easier for developers. You can upload your code to AWS Lambda and the service can run the code on your behalf using AWS infrastructure. AWS Lambda supports multiple coding languages: Node.js, Java, or Python.

After you upload your code and create a Lambda function, AWS Lambda takes care of provisioning and managing the servers that you use to run the code. In this lab, you will use AWS Lambda as a trigger-driven compute service where AWS Lambda runs your code in response to changes to an Amazon EC2 security group. The code for the Lambda function will be provided with this lab.

What Is Amazon CloudFormation?

AWS CloudFormation gives developers and system administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

You can use the AWS CloudFormation sample templates or create your own templates to describe the AWS resources, and any associated dependencies or runtime parameters, required to run your application. You don't need to figure out the order for provisioning AWS services or the subtleties of making those dependencies work. AWS CloudFormation takes care of this for you.

You can deploy and update a template and its associated collection of resources (called a stack) by using the AWS Management Console, AWS Command Line Interface, or APIs. AWS CloudFormation is available at no additional charge, and you pay only for the AWS resources needed to run your applications.

Create a security group

You're now going to create a security group in the AWS Management Console. This security group's ingress rules will be updated automatically by a Lambda function that you'll create subsequently to allow only the IP ranges belonging Amazon CloudFront and AWS WAF.

6.       In the AWS Management Console, click Services, then click EC2.

7.       In the navigation pane, click Security Groups.

8.       Click Create Security Group.

9.       For Security group name, type 

Note Copy the name to your clipboard as you will need it later.

10.    For Description, type 

11.    For VPC, choose Default VPC.

12.    Click Create.

13.    Now select the Security Group you created.

14.    Click Actions, and then click Add/Edit Tags.

15.    Click Create Tag.

Values are case sensitive. You will create a Lambda function that targets Security Groups with these tags to update security group rules.

16.    Create two tags with the following values:

  • Key: 
  • Value 

Then create another tag:

  • Key: 
  • Value: 

17.    Click Save.

Update IAM role for the Lambda function

When creating a Lambda function, it's important to understand and properly define the security context to which the Lambda function is subject.

An IAM role has already been created for you as part of the lab setup. In this section, you will create an IAM policy with the permissions needed for the Lambda function to execute and attach that to the existing IAM role.

Create an IAM policy

Note You can ignore any warnings you may see.

18.    In the AWS Management Console, on the Services menu click IAM.

19.    In the navigation pane, click Policies.

20.    Click Get Started, then click Create Policy.

21.    Select Create Your Own Policy.

22.    In Policy Name, type 

Note Copy the name to your clipboard for later use.

23.    Copy and paste the following policy document into the Policy Document box. As you paste the code, review it. Can you tell what the policy is doing?

    {

        "Version": "2012-10-17",

            "Statement": [

                {

                    "Effect": "Allow",

                    "Action": [

                        "logs:CreateLogGroup",

                        "logs:CreateLogStream",

                        "logs:PutLogEvents"

                    ],

                    "Resource": "arn:aws:logs:*:*:*"

                },

                {

                    "Effect": "Allow",

                        "Action": [

                            "ec2:DescribeSecurityGroups",

                            "ec2:AuthorizeSecurityGroupIngress",

                            "ec2:RevokeSecurityGroupIngress"

                        ],

                    "Resource": "*"

                }

            ]

    }

24.    Click Create Policy.

The policy you created provides permissions to the Lambda function to read the EC2 Security Groups and make necessary changes to their ingress rules. It also allows the Lambda function to write logs to the Cloudwatch Logs service.

Update IAM role

An IAM role was pre-created as part of the lab setup. The name of the role is lambda-role. In this section, you'll be attaching the IAM policy created in the previous section to lambda-role.

25.    In the AWS Management Console, click Services then click IAM.

26.    In the navigation pane, click Roles.

27.    Click the role named lambda-role.

28.    Click Attach Policy.

29.    On the Attach Policy screen, select policy you created earlier. You can search for the policy by entering  in the search filter.

30.    Click Attach Policy.

The role has now been updated with the required permissions needed for the Lambda function that will be setup in the next section.

Create the Lambda function

31.    On the Services menu, click Lambda.

32.    Click Create a Lambda function.

Note If you've never created Lambda functions before, you will click Get Started Now. If you have existing Lambda functions, click Functions on the navigation pane.

33.    You are prompted to select a blueprint. Blueprints can be a great starting point when you build your own Lambda function. However in this lab, the function code will be provided for you, so you should click Blank Function.

34.    You can skip configuring your trigger. On the Configure Triggers page, simply click Next.

35.    For Name, type . This is the function name.

36.    For Runtime, select Python 2.7.

37.    For Code entry type, click Edit code inline. Remove the placeholder code and paste the following code:

 

'''

Copyright 2015 Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/

or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

'''

 

import boto3

import hashlib

import json

import urllib2

 

# Name of the service, as seen in the ip-groups.json file, to extract information for

SERVICE = "CLOUDFRONT"

# Ports your application uses that need inbound permissions from the service for

INGRESS_PORTS = [ 80 ]

# Tags which identify the security groups you want to update

SECURITY_GROUP_TAGS = { 'Name': 'cloudfront', 'AutoUpdate': 'true' }

 

def lambda_handler(event, context):

    print("Received event: " + json.dumps(event, indent=2))

    message = json.loads(event['Records'][0]['Sns']['Message'])

 

    # Load the ip ranges from the url

    ip_ranges = json.loads(get_ip_groups_json(message['url'], message['md5']))

 

    # extract the service ranges

    cf_ranges = get_ranges_for_service(ip_ranges, SERVICE)

 

    # update the security groups

    result = update_security_groups(cf_ranges)

 

    return result

 

def get_ip_groups_json(url, expected_hash):

    print("Updating from " + url)

 

    response = urllib2.urlopen(url)

    ip_json = response.read()

 

    m = hashlib.md5()

    m.update(ip_json)

    hash = m.hexdigest()

 

    if hash != expected_hash:

        raise Exception('MD5 Mismatch: got ' + hash + ' expected ' + expected_hash)

 

    return ip_json

 

def get_ranges_for_service(ranges, service):

    service_ranges = list()

    for prefix in ranges['prefixes']:

        if prefix['service'] == service:

            print('Found ' + service + ' range: ' + prefix['ip_prefix'])

            service_ranges.append(prefix['ip_prefix'])

 

    return service_ranges

 

def update_security_groups(new_ranges):

    client = boto3.client('ec2')

 

    groups = get_security_groups_for_update(client)

    print ('Found ' + str(len(groups)) + ' SecurityGroups to update')

 

    result = list()

    updated = 0

 

    for group in groups:

        if update_security_group(client, group, new_ranges):

            updated += 1

            result.append('Updated ' + group['GroupId'])

 

    result.append('Updated ' + str(updated) + ' of ' + str(len(groups)) + ' SecurityGroups')

 

    return result

 

def update_security_group(client, group, new_ranges):

    added = 0

    removed = 0

 

    if len(group['IpPermissions']) > 0:

        for permission in group['IpPermissions']:

            if INGRESS_PORTS.count(permission['ToPort']) > 0:

                old_prefixes = list()

                to_revoke = list()

                to_add = list()

                for range in permission['IpRanges']:

                    cidr = range['CidrIp']

                    old_prefixes.append(cidr)

                    if new_ranges.count(cidr) == 0:

                        to_revoke.append(range)

                        print(group['GroupId'] + ": Revoking " + cidr + ":" + str(permission['ToPort']))

 

                for range in new_ranges:

                    if old_prefixes.count(range) == 0:

                        to_add.append({ 'CidrIp': range })

                        print(group['GroupId'] + ": Adding " + range + ":" + str(permission['ToPort']))

 

                removed += revoke_permissions(client, group, permission, to_revoke)

                added += add_permissions(client, group, permission, to_add)

    else:       

        for port in INGRESS_PORTS:

            to_add = list()

            for range in new_ranges:

                to_add.append({ 'CidrIp': range })

                print(group['GroupId'] + ": Adding " + range + ":" + str(port))

            permission = { 'ToPort': port, 'FromPort': port, 'IpProtocol': 'tcp'}

            added += add_permissions(client, group, permission, to_add)

 

    print (group['GroupId'] + ": Added " + str(added) + ", Revoked " + str(removed))

    return (added > 0 or removed > 0)

 

def revoke_permissions(client, group, permission, to_revoke):

    if len(to_revoke) > 0:

        revoke_params = {

            'ToPort': permission['ToPort'],

            'FromPort': permission['FromPort'],

            'IpRanges': to_revoke,

            'IpProtocol': permission['IpProtocol']

        }

 

        client.revoke_security_group_ingress(GroupId=group['GroupId'], IpPermissions=[revoke_params])

 

    return len(to_revoke)

 

 

def add_permissions(client, group, permission, to_add):

    if len(to_add) > 0:

        add_params = {

            'ToPort': permission['ToPort'],

            'FromPort': permission['FromPort'],

            'IpRanges': to_add,

            'IpProtocol': permission['IpProtocol']

        }

 

        client.authorize_security_group_ingress(GroupId=group['GroupId'], IpPermissions=[add_params])

 

    return len(to_add)

 

def get_security_groups_for_update(client):

    filters = list();

    for key, value in SECURITY_GROUP_TAGS.iteritems():

        filters.extend(

            [

                { 'Name': "tag-key", 'Values': [ key ] },

                { 'Name': "tag-value", 'Values': [ value ] }

            ]

        )

 

    response = client.describe_security_groups(Filters=filters)

 

    return response['SecurityGroups']

 

'''

 Sample Event From SNS:

{

  "Records": [

    {

      "EventVersion": "1.0",

      "EventSubscriptionArn": "arn:aws:sns:EXAMPLE",

      "EventSource": "aws:sns",

      "Sns": {

        "SignatureVersion": "1",

        "Timestamp": "1970-01-01T00:00:00.000Z",

        "Signature": "EXAMPLE",

        "SigningCertUrl": "EXAMPLE",

        "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e",

        "Message": "{\"create-time\": \"yyyy-mm-ddThh:mm:ss+00:00\", \"synctoken\": \"0123456789\", \"md5\": \"03a8199d0c03ddfec0e542f8bf650ee7\", \"url\": \"https://ip-ranges.amazonaws.com/ip-ranges.json\"}",

        "Type": "Notification",

        "UnsubscribeUrl": "EXAMPLE",

        "TopicArn": "arn:aws:sns:EXAMPLE",

        "Subject": "TestInvoke"

      }

    }

  ]

}

'''


Note
 The above Python code uses the AWS Python SDK (BOTO3) to do the following:

  • Gets the latest, updated IP ranges from the URL given in the SNS notification
  • Filter the IP ranges for Amazon CloudFront and AWS WAF IP ranges
  • Updates the VPC security group that has been tagged with "Name": "cloudfront" and "AutoUpdate: true"

38.    Under Lambda function handler and role, for:

  • Handler, select lambda_function.lambda_handler.
  • Role, select Choose an existing role
  • Existing Role, select lambda-role

39.    Under Advanced settings, increase the Timeout to 5 seconds.

Note If you are updating several security groups with this function, you might have to increase the timeout.

40.    Click Next.

41.    After confirming your settings are correct, click Create function.

Test Your Lambda Function

You'll test your Lambda function and initialize the security group created earlier.

42.    In the Lambda console, select your function, click Actions, and then click Configure test event.

43.    Review the event below. This represents an SNS notification. Remove the existing test event code and enter the following as your sample event.

    {

        "Records": [

            {

                "EventVersion": "1.0",

                "EventSubscriptionArn": "arn:aws:sns:EXAMPLE",

                "EventSource": "aws:sns",

                "Sns": {

                    "SignatureVersion": "1",

                    "Timestamp": "1970-01-01T00:00:00.000Z",

                    "Signature": "EXAMPLE",

                    "SigningCertUrl": "EXAMPLE",

                    "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e",

                    "Message": "{\"create-time\": \"yyyy-mm-ddThh:mm:ss+00:00\", \"synctoken\": \"0123456789\", \"md5\": \"7fd59f5c7f5cf643036cbd4443ad3e4b\", \"url\": \"https://ip-ranges.amazonaws.com/ip-ranges.json\"}",

                    "Type": "Notification",

                    "UnsubscribeUrl": "EXAMPLE",

                    "TopicArn": "arn:aws:sns:EXAMPLE",

                    "Subject": "TestInvoke"

                }

            }

        ]

    }

44.    After you've added the sample event, click Save and test. Your Lambda function will be invoked, and the output will report an error in execution. You should see the log output at the bottom of the console similar to the following.

    Updating from https://ip-ranges.amazonaws.com/ip-ranges.json

    MD5 Mismatch: got **some hash value** expected 7fd59f5c7f5cf643036cbd4443ad3e4b: Exception

    Traceback (most recent call last):

      File "/var/task/lambda_function.py", line 29, in lambda_handler

        ip_ranges = json.loads(get_ip_groups_json(message['url'], message['md5']))

      File "/var/task/lambda_function.py", line 50, in get_ip_groups_json

        raise Exception('MD5 Missmatch: got ' + hash + ' expected ' + expected_hash)

    Exception: MD5 Mismatch: got **some hash value** expected **some hash value**

You will see a message indicating there was a hash mismatch. Normally, a real SNS notification from the IP Ranges SNS topic will include the right hash, but because our

sample event is a test case representing the event, you will need to update the sample event manually to have the expected hash. Copy the hash value after the word "got" in the sentence "...got
 Hash Value expected 7fd59f5c7f5cf643036cbd4443ad3e4b".
"errorMessage": "MD5 Mismatch: got 001fd33aa4135060111a137ae58cb057 expected 7fd59f5c7f5cf643036cbd4443ad3e4b"
Use 001fd33aa4135060111a137ae58cb057

45.    In the Lambda console, select your function, click Actions, and then click Configure test event. Replace the md5 value in the Message field of the sample event with the hash value you copied in the previous step.

It should look something like this (note that this is just an example and your md5 falue may be different):

"md5\": \"88386cb87e7814b75bc518eb841e92bb\",

46.    Click Save and test.

Your Lambda function will be invoked. This time, you should see a succesfull output indicating your security group was properly updated with the IP ranges belonging to Amazon CloudFront and AWS WAF.

Verify Security Group update

The Lambda function when tested would have updated the previously created security group with the latest IP ranges belonging to Amazon CloudFront and AWS WAF. To view and verify the update:

47.    In the AWS Management Console, on the Services menu, click EC2.

48.    In the navigation pane, find the NETWORK & SECURITY heading, then click Security Groups.

49.    Select the security group created earlier, AutoUpdateSecurityGroup.

50.    In the bottom pane, select the Inbound tab.

You will now see all the CloudFront IP ranges added as allowed points of ingress.

Configure Lambda function's trigger

Subscribe the Lambda function to the SNS topic so that any changes in the IP ranges automatically gets updated on the security group's ingress rules.

51.    In the AWS Management Console, choose US East (N. Virginia) as the region in top right corner.

Important Ensure that the region is set to US East (N. Virginia) before proceeding with the next step.

52.    On the Services menu, click SNS.

53.    Click Get Started if that option is available on the page.

54.    In the navigation pane, click Subscriptions.

55.    Click Create subscription.

56.    For Topic ARN, enter 

57.    For Protocol, click AWS Lambda.

58.    For Endpoint, choose the Lambda function you created earlier. You should see a function named SecurityGroupAutoUpdate.

59.    For Version or alias, select default.

60.    Click Create subscription.


This subscription links the Lambda function that you created with the SNS topic so that any changes in the IP ranges when communicated over the SNS topic automatically invokes the fuction.

Verify SNS subscription as a Lambda trigger

61.    Change the region back to where you created the Lambda function.

62.    On the Services menu, click Lambda.

63.    Select your Lambda function.

64.    Select the Triggers tab.

65.    Verify that AmazonIPSpaceChanged is now a trigger for the Lambda function.


 

Conclusion

Congratulations! You have successfully created a Lambda function that gets triggered when AWS publishes service IP address updates. This subscription links the Lambda function that you created with the SNS topic so that any changes in the IP ranges when communicated over the SNS topic automatically invokes the fuction.

End Your Lab

Follow these steps to close the console, end your lab, and evaluate the experience.

66.    In the upper right of the navigation bar of the AWS Management Console, click yourqwiklabsacct@<AccountNumber>, and then click Sign Out.

67.    Close any active SSH client sessions or remote desktop sessions.

68.    On the Qwiklabs page, click End Lab.

69.    In the confirmation message, click OK.

Additional Resources