Data Parser for Unstructured Text - Tutorial 5 - Columnar Data with a Footer
Tutorial 5 guides you through the steps to create and save a script file in Data Parser for Unstructured Text that reads and flattens a data file containing both detail lines and a footer line with data to extract.
This tutorial is useful to anyone likely to be working with columnar data with footer. Before doing this tutorial, it is recommended that you do Data Parser Tutorial 3 - Columnar Data first.
By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.
Tutorial Goals
In this tutorial, you will learn:
- How to create a script that reads and flattens a data file with detail and footer lines
- How to save the script file
- New terms located throughout the documentation
Procedure
The steps in this tutorial should be completed in the order shown.
The primary data record information is in the table detail lines. This data is highly structured in neat consistent columns. Data Parser can build recognition patterns for Line Styles and Data Fields with this type of data automatically, saving you a lot of time and effort.
Define Line Styles and Data Fields
After selecting the tutorial file and setting up basic options, define line styles and data fields.Data Parser automatically creates a Line Style for the line and gives it a default Line Name of SALESMARKETING when you complete this task. Data Parser also automatically parses the line into 7 Data Fields using spaces as a column separator. The Data Fields are given default names of SALESMARKETING_1 through SALESMARKETING_7.
To define line styles and data fields:
- Select the first detail line (it begins with SALES/MARKETING) by clicking in the Line Style column immediately to the left of the line to highlight the entire line of text.
- Right-click in the Line Style Column (the yellow stripe on the left part of the screen), and select Parse Columnar Data.
- Proceed to Change Data Field Names.
Change Data Field Names
Since the Data Field names are used in the Browse Data Record as column headings for the Data Fields, change the Data Field names for SALES/MARKETING_1 through SALES/MARKETING_7 to more descriptive field names.
See the new, and more descriptive, names for the Data Fields in Table 3-4 below.
Table 3-4 Tutorial 5 - Suggested Data Field Names
| Default Name | Suggested Name |
|---|---|
| SALES/MARKETING_1 | Department |
| SALES/MARKETING_2 | Team1 |
| SALES/MARKETING_3 | Team2 |
| SALES/MARKETING_4 | Team3 |
| SALES/MARKETING_5 | Team4 |
| SALES/MARKETING_6 | Team5 |
| SALES/MARKETING_7 | DepartmentTotal |
To change Data Field names:
- In the Preferences menu, disable Close Definition Dialogs on Add/Update by unchecking it.
- Double-click on one of the Data Fields in the SALESMARKETING line to open the Field Definition window.
- In the Field Definition window, select the default field name, highlight it, and replace it with the corresponding descriptive name given above.
- Click Update.
- To select the next Data Field, click the Field Name arrow and a list of Data Fields that have been defined for the current Line Style is displayed. Select the next Data Field.
- Name the remaining Data Fields until you have named all the fields.
- Click Close.
- Double-click on the name, SALESMARKETING, in the Line Style column on the left of the screen. The Line Style Definition window appears.
- Type in a new name, Detail.
- Click Update.
- Proceed to Define Recognition Patterns.
Define Recognition Patterns
The SALESMARKETING Line Style is recognized by a pattern where columns 2 to 16 contain the text SALES/MARKETING. This pattern matches only the first detail line. It needs be general enough to match all of the detail lines in the text, but specific enough to match only the detail lines, not the TEAM TOTALS line.
Analyze the detail lines to find what makes them unique in comparison to other lines in the text. Things to look for are position of the Data Fields, contents of the Data Fields, anything that is consistent for each of the detail lines but not contained in non-detail lines. For example, the detail lines contain:
- Commas in positions 24, 34 and 75 on every line.
- Only letters, white space, and a / in columns 2 through 18.
- Only digits, white space, and commas in columns 20 through 79.
- A digit in position 78.
- An upper case letter in each of the first 5 positions.
Of all of the above observations, creating a pattern to look for uppercase letters in the first 5 positions is the best way to go. Here are some reasons why:
- Defining a pattern that checks for commas in positions 24, 34, and 75 would require 3 pattern lines and probably would not match every detail line in subsequent reports. Suppose in this same report (created a week later) Team 2 of the Development department went to a pre-paid weeklong class and they only spent 100 dollars on supplies for the class. This means that a comma would not be in position 34 of that detail line so it would not match the Line Style.
- Defining a pattern to check for letters, white space, and a / in columns 2 through 18 would require three pattern lines and also matches the column heading line.
- Defining a pattern to match lines that contains at least one digit in positions 20 through 79 and does not contain letters or / would require 3 pattern lines. It also matches the Team Totals line.
- Defining a pattern to match lines that contain a digit in position 79 would match the detail lines and the Team Totals line.
So, the best pattern to use is one that looks for capital letters in columns 2 through 6.
To define a recognition pattern:
- In the Line Style Definition window, click once in the Look For? cell in the first row of the grid, then click the arrow to display the Pattern Builder window.
- Change the value of the Type field from literal to character class by clicking in the Type cell, then clicking the arrow to display the allowable values for the Type field. Select character class from the list.
- Click in the Value cell, then click the arrow to display the allowable values for the Value field. Select upper case letters from the list.
- Highlight the value in the Count field and change it to 5. The value of the Begin field should be 2. Change the value of the End field to 6 and click OK.
- Click in the empty cell in the seecond row of the And/Or column. The string And automatically displays in that cell.
- Click in the first empty cell in the Search What? Column. Then click on the down arrow and select Column Range (m-n) from the list.
- Click in the first empty cell in the Operator column. Then click on the down arrow and select Does Not Contain from the list.
- Click in the first empty cell in the Look For? column. Then click on the down arrow. This opens the Pattern Builder window.
- If the Value column is not empty, delete the contents of that cell. Press
or place the mouse pointer in the cell in the Value column and click once to position the blinking cursor in that cell. Type a capital P. - Change the values in the Count, Begin and End cells to 1.
- Click OK in the Pattern Builder window.
- Click the Update and Close in the Line Style Definition window. Notice that Detail now appears beside each of the detail lines in the Line Style column, and not next to the Processing Date line.
- Proceed to Modify Data Fields.
Modify Data Fields
To modify the Data Fields in the Detail lines so the Data Parser can extract the data on the last line of the report:
- Double-click in the Department Data Field, the red text at the beginning of each detail line. The Field Definition window opens.
- Click on the Data Collection/Output tab, and click on the Array Field option to enable it.
- Click Update.
- Click the Data Field Name arrow and choose the next Data Field.
- Repeat this process for each field in any one line of text.
- Click Close.
- Proceed to Define Line Style.
Define Line Style
To define the PROCESSING DATE line:
- Highlight PROCESSING DATE:.
- Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Accept Record. Data Parser defines the Line Style using the first word of the highlighted text in that position as the recognition pattern and named the Line Style PROCESSING.
- Proceed to Define Data Field.
Define Data Field
To define the Data Field on the PROCESSING_DATE line:
- Highlight the date from columns 17 through 24.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
- When the Field Definition window opens, change the default Field Name to Date.
- Click Add in the Field Definition window.
- Browse the data Records to see how your data has changed.
- Rearrange the data fields as needed to meet the requirements of your export data file.
- Save and close your script.
Your transaction secured by high-grade AES-256 encryption.