Big Data Stampede 2014 - Hands on Lab | Asia Pacific Power of Analytics in your hands with IBM Big Data Ref: IBD-2041
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Big Data Stampede 2014 Hands on Lab | Asia Pacific Power of Analytics in your hands with IBM Big Data Ref: IBD-2041 1
Table of Contents Getting started with InfoSphere BigInsights .................................................................. 3 Exercise 1: BigInsights Web Console .............................................................................. 4 Launching the Web Console ....................................................................................................................... 5 Working with the Welcome page ................................................................................................................ 6 Administering BigInsights .......................................................................................................................... 8 Working with Files .................................................................................................................................... 11 Exercise 2: Social Media and BigSheets........................................................................ 16 Collecting Social Media ........................................................................................................................... 17 Creating a BigSheets Workbook ............................................................................................................... 20 Exercise 3: Querying data with IBM Big SQL ............................................................. 35 Connecting to the IBM Big SQL server .................................................................................................... 36 Creating a project and an SQL script file ................................................................................................ 38 Creating tables and loading data ............................................................................................................. 40 Running basic Big SQL queries ................................................................................................................ 42 Analyzing the data with Big SQL .............................................................................................................. 45 Exercise 4: Developing, publishing, and deploying your first big data app .............. 49 Developing your application .................................................................................................................... 49 Publishing your application ..................................................................................................................... 50 Deploy and run your application .............................................................................................................. 53 Exercise 5: Analyzing BigSheets data in IBM Big SQL .............................................. 56 Create and modify a BigSheets workbook ................................................................................................ 57 Export your Workbook .............................................................................................................................. 60 Creating an Eclipse project and new SQL script ..................................................................................... 62 Exporting a JSON Array to Big SQL ........................................................................................................ 68 Exercise 6: Text analysis of social media data............................................................. 70 Upload and deploy the sample text analytics application ........................................................................ 71 Upload sample data and create a master workbook ................................................................................ 73 Analyzing text content of social media posts ............................................................................................ 77 Summary .......................................................................................................................... 83 2
Getting started with InfoSphere BigInsights In this hands-on lab, you'll learn how to work with IBM InfoSphere BigInsights to analyze big data. In particular, you'll explore social media data using a spreadsheet-style interface to discover insights about the global coverage of a popular brand without writing any code. You'll also use Big SQL to query data managed by BigInsights and learn how to develop, test, publish, and deploy your first big data application. Finally, you'll see how business analysts can invoke a pre-packaged, custom text analytics application that analyzes the sentiments expressed about a popular brand in news and blog postings on the Web. After completing this hands-on lab, you’ll be able to: • Navigate the BigInsights Web Console • Explore big data using a spreadsheet-style tool • Query big data using Big SQL • Create and run an application • Analyze text in social media data using a custom-built extractor Allow 2.5 to 3 hours to complete the entire lab end-to-end. This version of the lab was designed using the InfoSphere BigInsights 2.1 Quick Start Edition. Throughout this lab, you will be using the following account login information: Username Password Linux biadmin biadmin 3
Exercise 1: BigInsights Web Console IBM’s InfoSphere BigInsights 2.1 Enterprise Edition enables firms to store, process, and analyze large volumes of various types of data. In this exercise, you’ll see how you can work with its Web console to administer your system, launch applications (jobs), monitor the status of applications, and perform other functions. For further details on the Web console or BigInsights, visit the product's InfoCenter. In addition, IBM's big data channel on YouTube contains a video demonstration of the Web console. After completing this hands-on lab, you’ll be able to: • Launch the Web console. • Work with popular resources accessible through the Welcome page. • Administer BigInsights by inspecting the status of your cluster, starting and stopping components, and accessing administrative tools available for open source components provided with BigInsights. • Work with the distributed file system. In particular, you'll explore the HDFS directory structure, create subdirectories, and upload files to HDFS. • Launch applications (jobs) and inspect their status. Allow 30 minutes to complete this section of lab. This lab is an introduction to a subset of console functions. Real-time monitoring dashboards and application linking are among the more advanced console functions that are out of this lab's scope. 4
Launching the Web Console The first step to accessing the BigInsights Web Console is to launch all of the BigInsights processes (Hadoop, Hive, Oozie, Map/Reduce etc.). Once started, these services will remain active for the remainder of this lab. __1. Select the Start BigInsights icon to start all services. __2. Select the BigInsights Console icon to launch the Web Console. __3. Verify that your Web console appears similar to this, and note each section: Tasks: Quick access to popular BigInsights tasks Quick Links: Links to internal and external quick links and downloads to enhance your environment Learn More: Online resources available to learn more about BigInsights 5
Working with the Welcome page This section introduces you to the Web console's main page displayed through the Welcome tab. The Welcome page features links to common tasks, many of which can also be launched from other areas of the console. In addition, the Welcome page includes links to popular external resources, such as the BigInsights InfoCenter (product documentation) and community forum. You'll explore several aspects of this page. __1. In the Welcome Tab, the Tasks pane allows you to quickly access common tasks. Select the View, start or stop a service task. If necessary scroll down. __2. This takes you to the Cluster Status tab. Here, you can stop and start Hadoop services, as well as gain additional information as shown in the next section __3. Click on the Welcome tab to return back to the main page. 6
__4. Inspect the Quick Links pane at top right and use its vertical scroll bar (if necessary) to become familiar with the various resources accessible through this pane. The first several links simply activate different tabs in the Web console, while subsequent links enable you to perform set-up functions, such as adding BigInsights plug-ins to your Eclipse development environment. __5. Inspect the Learn More pane at lower right. Links in this area access external Web resources that you may find useful, such as the Accelerator demos and documentation, BigInsights InfoCenter, a public discussion forum, IBM support, and IBM's BigInsights product site. If desired, click on one or more of these links to see what's available to you 7
Administering BigInsights The Web console allows administrators to inspect the overall health of the system as well as perform basic functions, such as starting and stopping specific servers or components, adding nodes to the cluster, and so on. You’ll explore a subset of these capabilities here. Inspecting the status of your cluster __1. Click on the Cluster Status tab at the top of the page to return to the Cluster Status window __2. Inspect the overall status of your cluster. The figure below was taken on a single-node cluster that had several services running. One service – Monitoring -- was unavailable. (If you installed and started all BigInsights services on your cluster, your display will show all services as running.). __3. Click on the Hive service and note the detailed information provided for this service in the pane at right. From here, you can start or stop the hive service (or any service you select) depending on your needs. For example, you can see the URL for Hive's Web interface and its process ID. 8
__4. Optionally, cut-and-paste the URL for Hive’s Web interface into a new tab of your browser. You'll see an open source tool provided with Hive for administration purposes, as shown below. __5. Close this tab and return to the Cluster Status section of the BigInsights Web console Starting and stopping a component __6. If necessary, click on the Hive service to display its status. __7. In the pane to the right (which displays the Hive status), click the red Stop button to stop the service __8. When prompted to confirm that you want to stop the Hive service, click OK and wait for the operation to complete. The right pane should appear similar to the following image 9
__9. Restart the Hive service by clicking on the green arrow just beneath the Hive Status heading, as seen in the previous figure. When the operation completes (may take a minute), the Web console will indicate that Hive is running again, likely under a process ID that differs from the earlier Hive process ID shown at the beginning of this lab module. You may need to use the Refresh button of your Web browser to reload information displayed in the left pane. Accessing service specific Web tools __10. Click on the Welcome tab __11. Click on the Access secure cluster servers button in the Quick Links section at right (ensure the pop-up blocker is disabled). __12. Inspect the list of server components for which there are additional Web-based tools. The BigInsights console displays the URLs you can use to access each of these Web sites directly. (Make sure your pop-up blocker is disabled for this step.) __13. Click on the namenode alias. Your browser will display the standard NameNode Web interface provided by Apache Hadoop, which is covered in a separate lab on HDFS and MapReduce. 10
Working with Files The Files tab of the console enables you to explore the contents of your file system, create new subdirectories, upload small files for test purposes, and perform other file- related functions. In this module, you’ll learn how to perform such tasks against the Hadoop Distributed File System (HDFS) of BigInsights. __1. Click on the Files tab of the Web console to begin exploring your distributed file system. __2. Expand the directory tree shown in the pane at left (/user/biadmin). If you already uploaded files to HDFS, you’ll be able to navigate through the directory to locate them. 11
__3. Become familiar with the functions provided through the icons at the top of this pane, as we'll refer to some of these in subsequent sections of this module. Simply point your cursor at the icon to learn its function. From left to right, the icons enable you to Copy a file or directory, move a file, create a directory, rename a file or directory, upload a file to HDFS, download a file from HDFS to your local file system, remove a file from HDFS, set permissions, open a command window to launch HDFS shell commands, and refresh the Web console page __4. Position your cursor on the user/biadmin directory and click the Create Directory icon to create a subdirectory for test purposes __5. When a pop-up window appears prompting you for a directory name, enter ConsoleLab and click OK __6. Expand the directory hierarchy to verify that your new subdirectory was created. 12
__7. Create another directory named ConsoleLabTest. __8. Use the Rename icon to rename this directory to ConsoleLabTest2 __9. Click the Move icon, when the pop up Move screen appears select the ConsoleLab directory and click OK. 13
__10. Using the set permission icon, you can change the permission settings for your directory. When finished click OK. __11. While highlighting the ConsoleLabTest2 folder, select the Remove icon and remove the directory. 14
__12. Remain in the ConsoleLab directory, and click the Upload icon to upload a small sample file for test purposes. __13. When the pop-up window appears, click the Browse button to browse your local file system for a sample file. __14. Navigate through your local file system to the directory where BigInsights was installed. For the IBM-provided VMware image, BigInsights is installed in file system: /opt/ibm/biginsights. Locate the …/IHC subdirectory and select the CHANGES.txt file. If you can’t find it, you may select a different file for this step. Click OK __15. Verify that the window displays the name of this file. Note that you can continue to Browse for additional files to upload and that you can delete files as upload targets from the displayed list. However, for this exercise, simply click OK __16. When the upload completes, verify that the CHANGES.txt file appears in the directory tree at left, if it is not immediately visible click the refresh button. On the right, you should see a subset of the file’s contents displayed in text format 15
__17. Highlight the CHANGES.txt file in your ConsoleLab directory and click the Download button. __18. When prompted, click the Save File button. Then select OK. __19. If Firefox is set as default browser, the file will be saved to your user Downloads directory. For this exercise, the default directory location is fine Exercise 2: Social Media and BigSheets IBM’s InfoSphere BigInsights 2.1 Enterprise Edition enables firms to store, process, and analyze large volumes of various types of data. In this exercise, you’ll see how you can find and ingest social media data from Boardreader and do some ad hoc data exploration with BigSheets. This lab is based on an article that can be found here: 16
http://www.ibm.com/developerworks/data/library/techarticle/dm- 1206socialmedia/index.html If you have not done so, it is suggested that you read the article before attempting this lab as additional details are contained in the article that will help you understand the business reasons for taking the actions we take in the following lab steps. You can also watch a 14-minute video from the author of the article here: http://www.youtube.com/watch?v=kny3nPwSZ_w After completing this hands-on lab, you’ll be able to: • Create a Big Sheets workbook in order to process some of the Boardreader input data that has been loaded into the HDFS for you. • Create Big Sheets visualizations to see some of the quantitative analytics. Allow 45 minutes to complete this lab. Where did this data come from? For time efficiency of this lab, the data was already collected using the BoardReader application. BoardReader was used to collect data from news and blog websites, with the output saved into your HDFS. Collecting Social Media __1. If your BigInsights services aren’t started, select the Start BigInsights icon. 17
__2. Launch the BigInsights WebConsole icon to launch the Web Console __3. Verify that your Web console appears similar to this: The social media output that will be used for this lab has been preloaded into the HDFS. This data was retrieved using the Boardreader app. To learn how to use the Boardreader app to pull social media data, refer to the BigSheets developerWorks article referenced above. __4. To review this data, use the Files tab to navigate to the following folder (/user/biadmin/sampleData/IBMWatson) and select the blogs-data.txt file as shown below 18
__5. In a future section, we will display this text file within BigSheets in order to display, interact with, and visualize the data within BigInsights. 19
Creating a BigSheets Workbook In this section, we will use the spreadsheet style interface known as BigSheets. BigSheets provides access to data in items known as “workbooks.” Creating a Workbook from Boardreader Data __1. Return to the Files tab. __2. Navigate to the /user/biadmin/sampleData/IBMWatson/blogs-data.txt file. Make sure this file is selected. __3. Click the Sheet radio button to view this data within a BigSheets interface. __4. The data that comes from the Boardreader application within BigInsights is formatted in a JSON Array structure. You will need to click on the pencil icon and select the JSON Array option for this file. Then click the green check. 20
__5. Save this as a Master Workbook named “Watson Blogs”. You can optionally provide a description. Click the Save button. __6. You will want to do the same thing for the news-data.txt file in the same folder. To do this, you will need to return to the Files tab, navigate to the file, and follow the same process. This time, you will provide the name “Watson News”. __7. Click on the “Workbooks” link in the upper left-hand corner of the page. __8. You should now see (2) or more workbooks on your system, depending on the configuration of your VM. The top (2) are the ones you created in the previous steps. __9. Adding tags allows us to quickly search and manage workbooks. Select the “Watson Blogs” workbook. __10. Scroll down to Workbook Details, add the following tags “Watson” “IBM” “Blogs” by selecting the green + and adding each individually. If you don’t see Workbook Details, you may need to toggle between Normal and Full Screen. 21
__11. From the BigSheets tab, you can quickly filter workbooks and search for a specific tag. Enter the term “tag: Blogs” to see all workbooks that have the associated tag. 22
Tailoring Your Workbooks In this section, we will first show you how to reduce the amount of data you are working with by removing unwanted columns from within your BigSheets Workbooks. We will also perform a merge or “union” of two data workbooks. __12. From the list of Workbooks (you launched in the previous steps), click on the link named “Watson News” to open this workbook in BigSheets. __13. This Master Workbook is a “base” workbook and has a limited set of things you can edit. Therefore, in order to begin to manipulate the data contained within a workbook, we will want to create a dependent workbook. __a. Click the “Build new Workbook” button __b. When the new Workbook appears, you can change its default name (by clicking on the pencil icon next to the name) to the new name of “Watson News Revised” then click the green check mark. __c. Click the Fit column(s) button to more easily see columns A through H on your screen . 23
__14. You have decided you would like to remove the column “IsAdult” from your workbook. This is currently column E. Click on the triangle next to the column name of “IsAdult” and select the “Remove” option to remove this from your new workbook. Did I lose data? Deleting a column does not remove data. Deleting a column in a workbook just removing the mapping to this column. __15. In this case, you want to keep only a few columns. In order, to more easily remove a larger number of columns (without having to do this same click- remove process), click the triangle again (from any column) and select the “Organize Columns…” option. __a. Click the red X button next to each column title you want to remove. In this case, KEEP the following columns… __i. Country __ii. FeedInfo __iii. Language __iv. Published __v. SubjectHtml __vi. Tags __vii. Type __viii. Url 24
__b. Click the green check mark button when you are ready to remove the columns you have selected to remove. __16. Click on the Fit column(s) button again to show columns A through H. You should see the following columns in your new workbook. __17. Select “Save and Exit”. You may input an optional description. Click Save to complete the save. __18. After clicking Save, you will be shown two buttons (run and close). Click the Run button to run the workbook. You can monitor the progress of your request by watching the status bar indicator in the upper right-hand side of the page. __19. To reduce the unwanted columns in the “Watson Blogs” workbook, you will want to perform the same steps above in order to wind up with a new workbook called “Watson Blogs Revised” 25
__20. Now, since we have two workbooks with the exact same structure, we can perform a “union” of these two workbooks as the basis for exploring the coverage of IBM Watson across the sources that Boardreader provided. __21. To perform this action, make sure you are currently in the “Watson News Revised” workbook. Click the “Build New Workbook” button again. __22. In the top left-hand side or bottom left, you should see a link called “Add sheets”. This allows you to perform additional analysis on your data within the current workbook. Click the “Add sheets” link. __23. The Load option will allow you to load data into the current workbook from another workbook. Click the Load icon and select the “Watson Blogs Revised” workbook link. __24. The system will ask you for a *Sheet Name and you should change Sheet1 to “Watson Blogs Revised” as the name of the new tab that will be created in your current workbook. 26
__25. Click the green check-mark button at this time to load the new workbook into your current workbook. __26. You should now see two tabs at the bottom on your current workbook. If you move your mouse over the second one, a tooltip will show the action and the name you provided for this sheet / tab within you current workbook. (Giving your tabs meaningful names will help you and other that use your sheets an easy way to understand your data processing flow(s).) __27. Next, we need to add a new sheet to perform the Union function. Select Union. __28. The Union function asked for the other “sheet” you would like to use. Select the triangle to expose the pull-down menu. __29. Select ”Watson News Revised” and then click the green plus-mark button. __30. Then provide the sheet name “News and Blogs”. Before you click the green check-mark button to add this new tab/sheet to your workbook, make sure your options match the example below. 27
__31. Click the green check-mark button to add this tab to your workbook. __32. Save and Exit and then run this new workbook. When prompted for a description, you can change the name of your new workbook from “Watson News Revised(1)” to “Watson News and Blogs”. Click the Save button. Then click the Run button to run the workbook. __33. Select the Workflow Diagram icon to see a mapping of the workbooks associated with the News and Blogs workbook. This can be done at any point to keep a clear picture of which workbooks you are extending to/from. __34. Close this frame and you can continue to the next section. Exploring Your Workbook In this section, we will explore the data in your new workbook. You will perform actions like sorting, charting, cleansing, and grouping to make analysis of the HDFS data easier via BigSheets. __35. If you are not already in the workbook, open the “Watson News and Blogs” workbook. __36. In this case, we want to keep our initial workbook “as is” and produce another workbook that contains the records in sorted order. So, click the “Build New Workbook” button to do this. __37. As a way to keep track of what you are doing, I typically suggest that you rename your new workbook right away as this helps to remind you of what you are working on. So, provide a new name for your workbook, like “Watson Sorted”. 28
__38. After looking at the data, you decide you would like to understand more about the language and the types of posts contained in your workbook. __39. Clicking on the triangle next to the column name of any column, you will want to click on the Sort -> Advanced option . __40. Then, you will click on the pull-down triangle to expose the list of columns under the “Add Columns to Sort” area. Click on the green + button to add the two columns you wish to sort on. Then, select the desired order for sorting each column. In this case, your “Advance Sort” should look like the following picture. __41. Click on the green check-mark button to continue and create the new tab/sheet with your desired sorting applied to it. __42. As with all new tabs/sheets, the system shows you a simulated result based on the rows of data BigSheets keeps in memory. You should be able to click on Fit column(s) to review the contents of both the Language and Type columns to see that your “advanced sort” was applied to this simulated set of data. 29
__43. Now, “Save and Exit” and then run your workbook. This will apply the sorting options to more than the first 2,000 rows the system operates on as a simulation. This will sort the entire, larger data in the workbook. So, you should see different results once your workbook has been run. For example, in the simulated data, only one Vietnamese row was showing. However, against the entire data set, you should see twenty (20) rows that are of the Vietnamese language. This is because more of the Vietnamese rows were in the data beyond the first 2,000 rows the system uses in memory for a simulated result before you click the run button. Review and confirm these results after the job reaches 100% and then you can move onto the next step. __44. To visualize this data better, you want to analyze the quantities of posts in the various languages in your current workbook. While still in the “Watson Sorted” workbook. __45. Click on the “Add chart” link in the lower left (this may take a minute to populate the chart types the first time it is run), click on chart, then pie, and fill out the following information to produce a pie chart of the languages used. __46. Click on the green check-mark button to create the chart tab. __47. Just like working with tabular data, you will see a simulated visualization. Again, this is based on the rows in cache. (If you click on the Close button here, you can interact with the chart which is based on simulated data. You would then click the “Run” button in the upper right.) 30
__48. Click the Run button to run the visualization against the entire data set. __49. Once the chart has been run, you can interact with it to find out the second, most-popular language for posts regarding IBM Watson is Russian. Move your mouse over this item within your pie chart to see these results. __50. Additionally, you have decided you need to clean up some of the data within the workbook. Based on the fifth and sixth largest languages in the pie chart you just generated, you can see they are both variations on the Chinese language. __51. Let’s perform an action to clean up this data which will show all Chinese items in the same bucket. To do this, we will combine these two items into one item called Chinese 31
__52. From the “Watson Sorted” workbook, you will want to click on the edit button in order to go back into edit mode for this workbook. __53. To make things easier, you can optionally click on the Fit column(s) button to make your columns thinner and to see more data on the screen. __54. At this point, you have decided to add another column to your workbook. To do this, click on the triangle next to the Language column name. Select the Insert Right -> New Column option. __55. Then, you will provide a name for your new column, like “Language_Revised” and then click the green check-mark button (or hit enter) to apply your new column name. __56. Your cursor is then moved to the fx (or function) area where you can provide the function to be used to generate the contents of your new column. __57. Enter the following formula as your function… IF(SEARCH('Chin*', #Language) > 0, 'Chinese', #Language) 32
This formula looks at the Language column indicated by #Language. If the #Language column starts with ‘Chin*’, then the new #Language_Revised column with contain ‘Chinese’. If it does not, the value of #Language is copied over to #Language_Revised. (See the original article, URL at the top of this document, for additional explanation of this formula.) __58. Clicking on the green check-mark button (or hitting enter) should produce content in your new column based on your new formula. __59. “Save and Exit” and you should notice that you receive a message to “Click run to update the data” __60. Click the Run button in the upper right to run the workbook. __61. Now, click on the Language Coverage tab that contains your previously generated pie chart. This now has the status of “needs to be run”. Before we run it, we need to change one of the settings on the pie chart to use our newly generated column named Language_Revised. __62. To change the settings, click on the triangle next to the Language Coverage tab. __63. Click to select the “Chart Settings” option. __64. Change the “Value:” item to be based on the new, Language_Revised column. 33
__65. Click on the green check-mark button to apply your new settings. __66. Click on the Run button to regenerate your pie chart. __67. Once your new pie chart has been generated, you should be able to see Chinese as a cleaned up, single item in your pie chart (compared to the two items you saw previously). With this cleansed data, Chinese is now the second largest and Russian is third. 34
Exercise 3: Querying data with IBM Big SQL Learn how to use IBM Big SQL, an SQL language processor, to summarize, query, and analyze data in a data warehouse system for Hadoop. Big SQL provides broad SQL support that is typical of commercial databases. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights, in the same way that you access databases from your enterprise applications. You can use the Big SQL server to execute standard SQL queries. Multiple queries can be executed concurrently. Big SQL provides support for large ad hoc queries by using MapReduce parallelism and point queries, which are low-latency queries that return information quickly to reduce response time and provide improved access to data. The Big SQL server is multi- threaded, so scalability is limited only by the performance and number of CPUs on the computer that runs the server. If you want to issue larger queries, you can increase the hardware performance of the server computer that Big SQL runs on. This tutorial uses data from the fictional Sample Outdoor Company. The Sample Outdoor Company sells and distributes outdoor products to third-party retailer stores around the world. The company also sells directly to consumers through its online store. For the last several years, the company has steadily grown into a worldwide operation, selling their line of products to retailers in nearly every part of the world. The fictional Sample Outdoor Company began as a business-to-business operation. The company does not manufacture its own products; rather, the products are manufactured by a third party, then the products are sold to third-party retailers. The company is expanding its business by creating a presence on the web. Now they also sell directly to consumers through their online store. You will learn more about the products and sales of the fictional Sample Outdoor Company by running Big SQL queries and doing the analysis on the data in the following lessons. 35
In this lab, you will use the InfoSphere BigInsights Tools for Eclipse to create Big SQL queries so that you can extract large subsets of data for analysis. After you complete the lessons in this module, you will understand the concepts and know how to do the following actions: • Connect to the Big SQL server by using the InfoSphere BigInsights Tools for Eclipse. Use the InfoSphere BigInsights Tools for Eclipse to load sample data and to create queries. • Use BigSheets to analyze data generated from Big SQL queries. • Use the InfoSphere BigInsights Console to run Big SQL queries. • Use the InfoSphere BigInsights Tools for Eclipse to export data. This module should take approximately 45 minutes to complete. Connecting to the IBM Big SQL server In this section, you view the JDBC connection, learn how to navigate to the Data Management view in Eclipse, and how to open the SQL editor. Big SQL is installed with the InfoSphere BigInsights in the following directory: $BIGSQL_HOME/bin/bigsql. Ensure that you complete the following actions before you begin the tutorial: __1. Review the status of Big SQL. From the InfoSphere BigInsights Console, click the Cluster Status page. Then, verify that the Big SQL node is in a Running state on the Big SQL Server Summary page. If it is not started, click on Start in the Big SQL Server Summary page. __2. Start Eclipse and select the default workspace. 36
If you created an InfoSphere BigInsights server connection to a particular BigInsights server, the Big SQL JDBC driver connection and the Big SQL JDBC connection profile were both created for you automatically. Ensure that a JDBC driver definition exists. Generally, either a JDBC or an ODBC driver definition must exist before you can create a connection to a Big SQL server. For the purposes of this tutorial, only a JDBC driver definition is required. __3. From the Eclipse environment, click Window > Open Perspective > Other > Database Development. Ensure that the Data Source Explorer view is open. __4. From the Data Source Explorer view, expand the Database Connections folder. __5. Right-click the Big SQL connection, select Properties. Verify the connection details. __6. Click Test Connection to make sure that you have a valid connection to the Big SQL server. If the ping succeeded, you may skip to the next section. If you do not have a connection, follow these steps: __a. From the Data Source Explorer view, right-click the Database Connections directory and click Add Repository. 37
__b. In the New Connection Profile window, select Big SQL JDBC in the Connection Profile Types list. Edit the text in the Name field or the Description field, and click Next. __c. In the Specify a Driver and Connection Details window, select IBM Big SQL JDBC Driver Default from the Drivers list. You can add or modify a driver by clicking the New Driver Definition icon or the Edit Driver Definition icon. __d. The Connection URL field displays the URL that is generated. For example: jdbc:bigsql://bivm:7052/default __e. Specify the Connect when the wizard completes check box to connect to the Big SQL server after you finish defining this connection profile. __f. Specify the Connect every time the workbench is started to connect to the Big SQL server automatically by using this connection profile when you launch this Eclipse workspace. __g. Click Test Connection to verify that the connection. Click Finish to save. Creating a project and an SQL script file You can create Big SQL scripts and run them from the InfoSphere BigInsights Eclipse environment. __1. Create an InfoSphere BigInsights project in Eclipse. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, select BigInsights Project, and then click Next. __2. Type myBigSQL in the Project name field, and then click Finish. __3. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective. __4. Create an SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select SQL Script, and then click Next. __5. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder. 38
__6. In the File name field, type aFirstFile. The .sql file extension is added automatically. Click Finish. __7. In the Select Connection Profile window, select the Big SQL connection. The properties of the selected connection display in the Properties field. When you select the Big SQL connection, the Big SQL database-specific context assistant and syntax checks are activated in the editor that is used to edit your SQL file. __8. Click Finish to close the Select Connection Profile window. __9. In the SQL Editor that opens the aFirstFile.sql SQL file you created, add the following Big SQL comments: 39
-- This is a beginning SQL script -- These are comments. Any line that begins with two -- dashes is a comment line, -- and is not part of the processed SQL statements. Some lines in the file contain two dashes in front of text. Those dashes mark the text as comment lines. Comment lines are not part of the processed code. It is always useful to comment your Big SQL code as you write the code, so that the reasons for using some statements are clear to all who must use the file. You will be using this file in later lessons. __10. Save aFirstFile.sql by using the keyboard short, CTRL-S. Creating tables and loading data In this lesson, you will create the tables, and load the data that you will use to run queries and create reports about the fictional Sample Outdoor Company. The examples in this tutorial use Big SQL tables and data that are managed by Hive. This tutorial uses sample data that is provided in the $BIGSQL_HOME/samples folder on the local file system of the InfoSphere BigInsights server. By default, the $BIGSQL_HOME environment variable is set to the installed location, which is /opt/ibm/biginsights/bigsql/. The time range of the fictional Sample Outdoor Company data is 3 years and 7 months, starting January 1, 2004 and ending July 31, 2007. The 43-month period reflects the history that is made available for analysis. The schema that is used in this tutorial is the GOSALESDW. It contains fact tables for the following areas: • Distribution • Finance • Geography • Marketing • Organization • Personnel • Products • Retailers • Sales • Time. Access the sample SQL files 40
__1. Open a Terminal. It is located in the desktop folder BigInsights Shell. __2. Change your directory to $BIGSQL_HOME/samples. Open and review the information in the README file. __3. Open a new Terminal and type the setup command: ./setup.sh -u biadmin -p biadmin __4. The script runs when you see this message: Loading the data .. will take a few minutes to complete .. Please check /var/ibm/biginsights/bigsql/temp/GOSALESDW_LOAD.OUT file for progress information. The script runs three files: GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and GOSALESDW_load.sql. __5. You can see the tables in the InfoSphere BigInsights Console Files tab: hdfs://biginsights/hive/warehouse/gosalesdw.db 41
Running basic Big SQL queries You now have tables with data that represents the fictional Sample Outdoor Company. In this lesson, you will explore some of the basic Big SQL queries, and begin to understand the sample data. You will use the SQL file that you created earlier to test the tables and data that you created in the previous lesson. In this lesson, several sample SQL files that are included with the downloaded Big SQL samples can be used to examine the data, and to learn how to manipulate Big SQL. __1. From the Eclipse Project Explorer, open the myBigSQL project, and double-click the aFirstFile.sql file. __2. In the SQL editor pane, type the following statement: SELECT * FROM GOSALESDW.GO_REGION_DIM; Each complete SQL statement must end with a semi-colon. The statement selects, or fetches, all the rows that exist in the GO_REGION_DIM table. __3. Click the Run SQL icon. 42
Depending on how much data is in the table, a SELECT * statement might take some time to complete. Your result should contain 21 records or rows. You might have a script that contains several queries. When you want to run the entire script, click the Run SQL icon. When you want to run a specific statement, or set of statements, and you include the schema name with the table name (gosalesdw.go_region_dim), highlight the statement or statements that you want to run and press F5. __4. Improve the SELECT * statement by adding a predicate to the second statement to return fewer rows. A predicate is a condition on a query that reduces and narrows the focus of the result. A predicate on a query with a multi-way join can improve the performance of the query. SELECT * FROM GOSALESDW.GO_REGION_DIM WHERE REGION_EN LIKE 'Amer%'; __5. Click Run SQL to run the entire script. This query results in four records or rows. It might run more efficiently than the original statement because of the predicate. Tip: If you get an error, type out the full query instead of using copy/paste. __6. You can learn something more about the structure of the table with some queries to the syscat table. The Big SQL catalog tables provide metadata support to the database. The Big SQL catalog consists of four tables in the schema SYSCAT: schemas, tables, columns, and index columns. Type the following query, and this time, select the statement, and press F5. SELECT * FROM syscat.columns 43
WHERE tablename= 'go_region_dim' AND schemaname= 'gosalesdw'; This query uses two predicates in a WHERE clause. The query finds all of the information from the syscat.columns table as long as the tablename is 'go_region_dim' and the schemaname is 'gosalesdw'. Since you are using AND, both predicates must be true to return a row. Use single quotation marks around string values. The result of the query to the syscat.columns table is the metadata, or the structure of the table. Look at the SQL Results tab in Eclipse to see your output. The SQL Results tab in Eclipse shows 54 rows as your output. That means that there are 54 columns in the table go_region_dim. __7. In addition to the query to learn about the structure of a table, you can also run a query that returns the number of rows in a table. Type the following query, select the statement, and then press F5. SELECT COUNT(*) FROM gosalesdw.go_region_dim; The COUNT aggregate function returns the number of rows in the table, or the number of rows that satisfy the WHERE clause in the SELECT statement, if a WHERE clause was part of the statement. The result is the number of rows in the set. A row that includes only null values is included in the count. In this example, there are 21 rows in the go_region_dim table. __8. Another form of the count function is the COUNT (distinct ) statement. As the name implies, you can determine the number of unique values in a column, such as region_en. Type this statement in your SQL file: SELECT COUNT (distinct region_en) FROM gosalesdw.go_region_dim; 44
The result is 5. This result means that there are five unique region names in English (the column name region_en). __9. Another useful statement in Big SQL is the LIMIT clause. The LIMIT clause specifies a limit on the number of output rows that are produced by the SELECT statement for a given document. Type this statement in your SQL file, select the statement, and then press F5: SELECT * FROM GOSALESDW.DIST_INVENTORY_FACT limit 50; The statement returns 53837 rows without the limit clause and depending on the setting in the Eclipse preferences for the SQL Results View. If there are fewer rows than the LIMIT value, then all of the rows are returned. This statement is useful to see some output quickly. __10. Save your SQL file. Analyzing the data with Big SQL In this lesson, you create and run Big SQL queries so that you can better understand the products and market trends of the fictional Sample Outdoor Company. __1. Create an SQL file. From the Eclipse menu, click File > New > Other. In the Select a wizard window, expand the BigInsights folder, select SQL Script from the list of wizards, and then click Next. In the New SQL File window, select the myBigSQL project folder in the Enter or select the parent folder field. In the File name field, type companyInfo. The .sql file extension is added automatically. Click Finish. To learn what products were ordered from the fictional Sample Outdoor Company, and by what method they were ordered, you must join information from multiple tables in the 45
gosalesdw database because it is a relational database where not everything is in one table. __2. Type the following comments and statement into the companyInfo.sql file: -- Fetch the product name and the quantity and -- the order method. -- Product name has a key that is part of other -- tables that we can use as a join predicate. -- The order method has a key that we can use -- as another join predicate. -- Query 1 SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key; Because there is more than one table reference in the FROM clause, the query can join rows from those tables. A join predicate specifies a relationship between at least one column from each table to be joined. • The predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables. • This query also uses an alias in the SELECT and FROM clauses, such as pnumb.product_name. pnumb is the alias for the gosalesdw.sls_product_lookup table. That alias can now be used in the where clause so that you do not need to repeat the complete table name, and the WHERE clause is not ambiguous. • The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization. 46
__3. Highlight the statement, beginning with the keyword SELECT and ending with the semi-colon, and press F5. Review the results in the SQL Results page. You can now begin to see what products are sold, and how they are ordered by customers. __4. By default, the Eclipse SQL Results page limits the output to only 500 rows. You can change that value in the Data Management preferences. However, to find out how many rows the query actually returns in a full Big SQL environment, type the following query into the companyInfo.sql file, then select the query, and then press F5: --Query 2 SELECT COUNT(*) --(SELECT pnumb.product_name, sales.quantity, -- meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key; 47
The result for the query is 446,023 rows. __5. Update the query that is labeled --Query 1 to restrict the order method to equal only 'Sales visit'. Add the following string just before the semi-colon: AND order_method_en='Sales visit' __6. Select the entire --Query 1 statement, and press F5. The result of the query displays in the SQL Results page in Eclipse. The results now show the product and the quantity that is ordered by customers actually visiting a retail shop. 48
__7. To find out which method of all the methods has the greatest quantity of orders, you must add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). You will also use a SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. In addition, you can clean up the output a bit by using an alias (as Product) to substitute a more readable column header. -- Query 3 SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en; __8. Select the complete statement, and press F5. Your results in the SQL Results page show 35 rows. Exercise 4: Developing, publishing, and deploying your first big data app In this lesson, you will learn how to publish an application that can run a simple Big SQL script. After you create scripts that contain Big SQL statements or commands, you can create applications that can run those Big SQL scripts with or without variables. For the purposes of this lesson, you are going to create a new project that contains one simple Big SQL script. Developing your application __1. Create a project to hold the script and the application information. In the Eclipse Project Explorer, click File > New > BigInsights Project. In the New BigInsights Project window, type MyTestApp in the Project name field. Click Finish. __2. Verify that the sample data (IOD_sampleData/RDBMS_data.csv) is located on your desktop folder called biadmin’s Home 49
Create a Big SQL script within the MyTestApp project that creates a table if it does not already exist, and then loads data into the table from a local file. __3. Right-click the MyTestApp project and select New > SQL Script. In the File name field, type LoadLocal.sql, select BigSQL connection. Click Finish. __4. Type or copy the following code into the LoadLocal.sql file: create table if not exists media_csv (id integer not null, name varchar(50), url varchar(50), contactdate string) row format delimited fields terminated by ',' stored as textfile; load hive data local inpath '/home/biadmin/IOD_sampleData/RDBMS_data.csv' -- overwrite into table media_csv; Be sure to modify the inpath parameter value to match the path where you extracted the files. __5. Save the LoadLocal.sql file. Publishing your application __1. In the Eclipse Project Explorer, right-click the MyTestApp project, and select BigInsights Application Publish. 50
From the BigInsights Application Publish window, complete the wizard with the following information: __2. Location page - Verify the server name and click Next. If you do not have a valid connection to a Big SQL server, you can create one by clicking Create. __3. Application page - Specify to Create New Application. __a. Type an application name in the Name field, such as myBigSQLApp. 51
__b. You can type some text in the Description field. This text is optional, but can help others to understand how to run your application when it is deployed on the server. __c. Optionally, select an icon that contains a visual identifier for this application file from the local file path by clicking Browse. You can use a default icon that is provided by the server. __d. List one or more categories, such as load, that you want the application to be listed under. This category is a helpful tag when you search for new applications to deploy. __e. Click Next. __4. Type page - Select Workflow as the Application Type, and click Next. __5. Workflow page __a. Specify to Create a new single action workflow.xml file __b. Select Big SQL from the Action Type drop down list. __c. In the Properties pane, list the properties that you need. 52
__i. Specify the name of the script that contains your Big SQL statements so that you can run that script from the InfoSphere BigInsights server. Select script from the list. Enter “LoadLocal.sql” in the Value field. The Variable field is false. Specify the credentials properties file for connecting to Big SQL. __ii. Click New. __iii. In the New Property window, in the Name field, select the property credentials_prop from the list. __iv. In the Value field, specify the name of your properties file in the BigInsights credentials store. Type /user/biadmin/ credstore/private/bigsql.properties __d. Parameters page - If you have parameters that you want to add, click New. In this case, click Next. Note: The simple script that you are using in this lesson does not need parameters. __6. Publish page - You see the structure of your application in the preview of the BIApp.zip file. You can click Add to add other scripts or supporting information. Click Finish. Deploy and run your application 53
The application that you just published from Eclipse is now in the InfoSphere BigInsights Console. __1. Click the Application tab of the InfoSphere BigInsights Console, and click Manage to see the new application. __2. Find the myBigSQLApp application by the application title. __3. Highlight the application and click Deploy. __4. In the Deploy Application window, click Deploy. __5. Click Run in the Applications tab. You see the myBigSQLApp in the list of deployed applications. Double-click the application. 54
__6. In the application, type some name in the Execution Name field, such as Test1 to identify a particular run time instance. Since everything in this application was hardcoded, you can now click Run. __7. If the application runs successfully, you see a green checked box. If the application failed, you see a red box. You can examine the log by selecting the arrow in the Details column of the Application History pane. Examine the output of your application: __8. Click on the Files tab of the InfoSphere BigInsights Console. Navigate to the directory of the Hive warehouse (such as /biginsights/hive/warehouse). __9. You can see that a new folder that represents the media_csv table is created. Expand the folder to see the RDBMS_data.csv file. __10. Click the file to display its contents. __11. You can also issue a select statement from your Eclipse project, or from the Big SQL Console of InfoSphere BigInsights to see the contents in Big SQL: Select * from media_csv; You have created an application that creates a table with columns, if that table does not already exist. The application then loads data into the table from a local file. If you run the application again, it loads a duplicate set of the data. You can fill the file with new data, and run the application again, and it loads that new data. You can quickly see that this kind of exercise can be useful if you need to see refreshed data regularly. By adding scheduling information onto the application, you can have the application run daily or weekly to add or refresh the data. By including variables into the application, you can run the application with data from your dynamic input. You can uncomment the overwrite clause to add new data for each application run. 55
You can also read