Data Parser for Unstructured Text - Tutorial 3 - Columnar Data

Tutorial 3 guides you through the steps to create and save a script file in Data Parser for Unstructured Text that reads and flattens a report containing columnar data.

In Tutorial 3, you convert the data in a columnar report file, to a flattened format, using the more automatic features of Data Parser.

This tutorial introduces more of the time-saving features of Extract Schema Designer. Since a great many report formats contain columnar data of some kind, it is highly useful to anyone who wants to use Data Parser.

By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.

Unlike the previous tutorials, this file has multiple Accept Record line styles in a single page of the report. The primary data record information is in the table detail lines. Each line is essentially a record. Each of these is an Accept Record line.

Tutorial Goals

In this tutorial, you will learn:

  • How to create a script that reads and flattens a report with columnar data
  • How to use more automatic features of Data Parser
  • How to save the script file

Procedure

The following steps should be completed in the order shown:

Define Line Styles and Data Fields for Detail Lines

After selecting the tutorial file and setting up basic options, define line styles and data fields. Data Parser does the following when you complete this task:

  • Divides the line into seven Data Fields using spaces as a column separator. The Data Fields are given default field names SALES/ MARKETING_1 through SALES/MARKETING_7.
  • Creates a Line Style for the line. The Line Style that is automatically created has a default Line Name of SALESMARKETING. It identifies all lines in the report that have the string SALES/MARKETING in positions 1 through 16.
To define the line styles and data fields for detail lines:
  1. Select the first detail line (it begins with SALES/MARKETING) by clicking in the Line Style column (the narrow yellow stripe on the left) immediately to the left of the line.
  2. This highlights the entire line of text.
  3. Right-click in the Line Style Column (the yellow part of the screen on the left) and select Parse Columnar Data.
  4. From the menu, select Preferences and click once on Close Definition Dialogs on Add/Update to disable the option.
  5. To view the definitions of the Data Fields created, double-click on the colored sections of the line.
  6. For example, double-clicking on the green numbers 75,249 in the Data Panel brings up the SALES/MARKETING_2 Data Field in the Field Definition window. SALES/MARKETING_2 is the default Data Field name given to the second Data Field in the SALESMARKETING line. It starts in position 20 and ends in position 27. Since it is defined for the Line Style SALESMARKETING, only lines that match that recognition pattern contain this Data Field in positions 20 through 27.
  7. Proceed to Change Data Field Names.

Change Data Field Names

The Browse Data Record uses the Data Field names as column headings for the Data Fields, so it is a good idea to change the Data Field names for SALES/MARKETING_1 through SALES/MARKETING_7 to more descriptive field names.

To change Data Field names:
  1. Double-click on one of the Data Fields in the SALESMARKETING line to open the Field Definition window.
  2. In the Field Definition window, highlight the default Field Name and replace it with a corresponding descriptive name.
  3. See table 3-3 below.
  4. Click Update.
  5. To select the next Data Field, click the Field Name arrow to display a drop-down list of Data Fields that have been defined for the current Line Style.
  6. Select the next Data Field and continue until you have renamed all the fields. Close the Field Definition window when finished.
  7. Proceed to Change Line Style Name and Definition.
Table 3-3: Tutorial 3 - Suggested Data Field Names
Default Name Suggested Name
SALES/MARKETING_1 Department
SALES/MARKETING_2 Team1
SALES/MARKETING_3 Team2
SALES/MARKETING_4 Team3
SALES/MARKETING_5 Team4
SALES/MARKETING_6 Team5
SALES/MARKETING_7 DepartmentTotal

Change Line Style Name and Definition

To view the new Line Style SALESMARKETING, double-click on the name SALESMARKETING in the Line Style column (the yellow column on the left of your screen). The Line Style Definition window appears.

Notice the SALESMARKETING Line Style is recognized by a pattern where columns 1 to 16 contain the string SALES/MARKETING.

To change the Line Style Name and Line Action:
  1. In the Line Style Definition window, change the Line Style Name by highlighting SALESMARKETING in the Line Style Name box and replacing it with Detail.
  2. Also in the Line Style Definition window, click the Line Action tab and select the ACCEPT Record Including option.
  3. Click Update.
  4. Proceed to Define Line Recognition Rules.

Define Line Recognition Rules

The Detail Line Style only matches lines that have SALES/MARKETING in columns 2 through 16. That is the recognition pattern that Data Parser created automatically, but it is not the pattern that is needed in this case. The pattern needs to be general enough to match all of the detail lines in the text, but specific enough to match ONLY the detail lines. Update the Line Pattern so that the Line Style match all of the detail lines excluding the TEAM TOTALS line.

Analyze the detail lines to find what makes them unique in comparison to other lines in the text. Things to look for are position of the Data Fields, contents of the Data Fields, anything that is consistent for each of the detail lines but not contained in non-detail lines. For example, the detail lines contain:

  • Commas in positions 24, 34 and 75 on every line
  • Only letters, white space, and a "/" in columns 2 through 18
  • Only digits, white space, and commas in columns 20 through 79
  • A digit in position 78
  • An upper case letter in each of the first 5 positions

Of all of the above observations, creating a pattern to look for uppercase letters in the first five positions is the best way to go. Here are some reasons why:

  • Defining a pattern that would check for commas in positions 24, 34, and 75 requires three pattern lines and probably would not match every detail line in subsequent reports.
  • Suppose in this same report (created a week later) Team 2 of the Development department went to a pre-paid weeklong class and they only spent 100 dollars on supplies for the class. This means that a comma would not be in position 34 of that detail line so it would not match the Line Style, and the essential data on that line would be lost.
  • Defining a pattern to check for letters, white space, and a "/" in columns 2 through 18 would require three pattern lines and would also match the column heading line.
  • Defining a pattern to match lines that contains at least one digit in positions 20 through 79 and do not contain letters or "/" would require three pattern lines and it would match the detail lines. However, it also matches the Team Totals line.
  • Defining a pattern to match lines that contain a digit in position 79 would match the detail lines and the Team Totals line.
To define a pattern that looks for upper case letters in positions 2 through 6:
  1. Click the Line Recognition Rules tab in the Line Style Definition window.
  2. Click once in the Look For? cell in the first row of the grid and click the down arrow.
  3. The Pattern Builder window appears.
  4. In the Pattern Builder window, click in the Type cell and click the down arrow to display the allowable values for the Type field.
  5. Select character class from the list.
  6. This tells Extract Schema Designer what kind of data it needs to match for that line style to be valid.
  7. Tab to the Value cell and click the arrow to display the allowable values for the Value field.
  8. Select upper case letters from the list.
  9. This tells Data Parser the specific data it needs to match for that line style to be valid.
  10. Change the value in the Count cell to 5 by highlighting the value there and typing a 5.
  11. Change the value of the Begin field to 2 and the End field to 6.
  12. This tells Data Parser where to look for the data you specified and how many of that particular data must be found for the line style to match that line.
  13. Click OK.
  14. Click Update to save the modified line style definition.
  15. Proceed to Define Data Fields.

Define Data Fields

In this part of the exercise, you will define the rest of the data in the record, starting with the report title.

To define the ReportTitle Data Field:
  1. Select the report title ABC CORPORATION BUDGET on line 1 by highlighting it in the Data Panel.
  2. Right-click in the Data Panel and select Define Data Field > New Data Field.
  3. The Field Definition window appears.
  4. Change the default name to ReportTitle.
  5. Click Add.
  6. Data Parser takes the selected text and Data Field name to automatically define a Data Field named ReportTitle and a line style as well, named ABC_CORPORATION_BUDG.
  7. Click Close.
  8. Double-click on ABC_CORPORATION_BUDG in the Line Style Column to display the Line Style Definition window.
  9. Notice that Data Parser automatically creates a recognition pattern that looks for the literal ABC CORPORATION BUDGET in positions 27 through 48.
  10. In the LineStyleName field, type ReportTitle.
  11. Click Update and Close.
  12. Proceed to Define Line Styles.

Define Line Styles

  1. Select the report date 10/26/95 on line 2 by highlighting the text with the mouse.
  2. Right-click in the Data Panel and select Define Data Field > New Data Field.
  3. The Field Definition window appears.
  4. Change the default name to ReportDate.
  5. Click Add and then Close.
  6. Data Parser takes the selected text and enters Data Field name and automatically define a Data Field and a Line Style.
  7. Double-click Style1 in the Line Style Column to display the Line Style Definition window.
  8. Notice that Data Parser automatically created a recognition pattern that looks for the literal "/" in positions 35 and 38.
  9. Rename the Line Style to ReportDate.
  10. Click Update and Close.
  11. Browse the data Records to see how your data has changed.
  12. If desired, rearrange the data fields as needed to meet your export file layout requirements.
  13. Save and close your script.