Pervasive Data Extractors Online Help - Table of Contents
Pervasive DataTools
Data Extractor User’s Guide
Pervasive Software Inc.
12365 Riata Trace Parkway
Building B
Austin, TX 78727 USA
Web: http://pervasivedatatools.com
About This Manual
This manual is currently a work in progress and therefore is incomplete. Documentation for the Data Extractor is also available by clicking on the Help button on the right end of the button bar whenever the product is running.
This manual leads you through the operation of the Data Extractor user interface. The Pervasive Data Extractor allows you to extract useful data from report files and convert that data to a CSV Text file. You must have a non-expired Data Extractor license to run this application.
Refer to the license.txt file in the default installation directory for disclaimers and information about trademarks and credits.
Table of Contents
Getting Started with Data Extractor
Tutorials
Using Data Extractor
All About Line Styles
All About Data Fields
Viewing the Extracted Data
Exporting the Extracted Data
Saving and Reusing Extract Scripts
Reference - User Interface
- Tool Bar Buttons
- Main Menus
- Extract Manager Window
- Extract Script Designer
- Source Options Window
- Debug Extract Design Window
- ACCEPT Record Definition Window
- Accept Record Reorder
- Record Browser Window
- Multi-Record Browser
- Pattern Builder Window
- Line Order for Extract Window
- All Fields Window
- Edit Fields Window
- Export Field Order Window
- Find Text
- Pop-up Menu - Line Style Column
- Pop-up Menu - Data Panel
- Line Style Definition Window
- Data Field Definition Window
Appendix
Introduction to Data Extractor
The Data Extractor is a software product with the ability to read complex text files of many kinds. The amount of computer data grows vastly each year, and much of it is provided in raw text formats. Some examples of the many sources handled by the Data Extractor follow:
- Printouts from programs captured as disk files
- Reports of any size or dimension
- ASCII or any type of EBCDIC text files
- Spooled print files
- Fixed length sequential files
- Complex multi-line files
- Downloaded text files (e.g., news retrieval, financial, real estate)
- HTML and other structured documents
- Internet text downloads
- Email header and body
- Online textual databases
- CD-ROM textbases
- Files with tagged data fields
- XML
- HL7
- Swift
- And many others...
Using Data Extractor, you can extract the desired data fields from various lines in the text file, and assemble those fields into a flat record of data. Thus, whole records of structured data can be extracted and presented in a conventional tabular (row and column) format that is needed before mapping and converting the data to a popular target format. Some of the features that make the Data Extractor so complete are:
- No practical limits on file size
- Reads almost any kind of report architecture as long as there are rules
- Support for large fields and records
- Handles floating headers, footers and details
- Can automatically detect and propose recognition patterns
- Handles tagged data fields
- Autoparses columnar and tagged data
- Powerful debugging tools
- Structured data browser to see results prior to export
- Built on an extensible, extremely rich scripting language
The extraction of desired fields from the source text file is accomplished by visually marking up the file in the Data Extractor user interface. The mouse is employed to select the desired fields from various lines displayed on the screen. Dialog boxes on the screen allow you to express a rich set of pattern recognition rules and actions to assist in the extraction of clean data.
Several techniques are available to view samples of extracted data. Apart from scrolling the full text of the data, a debug window can be used to search for all lines satisfying certain extraction criteria. For details, see **Debug Extract Design Window**. In addition, users can pop up a data browser that assembles all the fields and records in a grid format to give the user an idea of how the data will export. For details, see **Record Browser Window**.
Data Extraction Basics
The Data Extractor is a tool for extracting data that would otherwise be inaccessible.
Consider these scenarios:
- Your company is attempting to migrate several years’ worth of data from a legacy application. The data files for this application are stored in an unknown proprietary format, possibly with compressed or encrypted fields. Although the data cannot be accessed directly, your legacy application can generate reports.
- Your agency needs to merge data from several disparate sources into a single, easily accessible format. For example, you receive listings of real-estate properties from several different electronic sources that you want to combine into one standard listing format for your web site.
- One of your clients needs to extract specific data from many large log files and aggregate that data into a database for statistical analysis.
In each of these scenarios, the Data Extractor can extract valuable data from standard formated text files with lots of irrelevant information, such as headers and comments.
The Data Extractor exports the extracted data to CSV (Comma Separated Values) Text file format. If you want to convert the data to another format, or you want to manipulate the data further after you have extracted it, the Data Loaders can accomplish this. The Data Loaders support over 100 different file types, allowing you to convert your data to the vast majority of databases used throughout the world.
To Use Data Extractor
First you need to have a report file. Most applications on nearly every type of platform give you the option of creating and printing reports. Have the program print the report in a text only format, either ASCII or any standard EBCDIC code page. For more information, see How to Create a Report File.
- Start a new script in the Data Extractor and select the report file.
- Look at the report in the Data Extractor.
Notice the overall pattern of the report when it repeats, the page layout, and the style used to organize information. Locate the data that you want to extract.
- Input the structural information. The Data Extractor needs patterns and structural rules to identify important data.
- Define line styles by marking which lines have important information and how they can be recognized.
- Define data fields by marking the data that you want collected, and where it can be found.
- Specify line actions.
While you are defining line styles and data fields, select options that specify how you want the data to be assembled into records and fields. The default action is to collect the fields. You must find the end of the first record, or the beginning of the second, and change the action for that line to Accept Record. This stops the collection process for the first record and begins the collection process for the second, thus setting exactly which fields are included in the eventual output for that record.
If you want to define more than one type of record in a single report file, you can do that by defining more than one Accept Record line style.
- Assign the fields to each record type, according to how you want the data to be exported.
- Browse your data.
Once you have entered all the information Data Extractor needs to find your data, and specified how you want it structured, the Data Extractor automatically builds that structure internally. You can open the data browser and see it in a grid. If the fields or records are not structured the way you want them, go back and adjust the data field and/or line style definitions.
- Finally, save the script.
By saving your script, you can use it again if you need to extract data from a report with the same style in the future.
Additional details about each of these steps are described in this documentation.
Feature Segmentation
The following list presents some of the features available in the Data Extractor:
- EBCDIC code page translation
- Recognizes special characters and invisible characters
- Multiple record Accepts
- Mailing Label Template autoparse
- Validates scripts automatically
- View source data in external applications
- Grid lines on Data Panel
- Extract/Mine data from irregular text files
- Auto New Line Style menu option
- Auto New Data Field menu option
Some additional automatic menu options:
- Parse Columnar Data
- Parse Columnar Data w/ Heading
- Parse on Field Separator
- Parse Tagged Data
- Parse Standards Data
- Parse XML/HTML Data
- Parse HL7 Data
- Parse Swift Data
- Parse LDIF Data
- Parse EDI Data
About the Tutorials
Seven step-by-step tutorials are available to help you learn how to use the Data Extractor. We recommend that you complete the tutorials in the order in which they appear, as each tutorial builds upon the concepts covered in the previous tutorial.
Common Tasks
The Data Extractor tutorials all have several tasks in common. Those tasks are described here, and you may refer to them as needed.
- Select Correct Tutorial File and Set Basic Options
- Browse Data Records
- Rearrange Data Fields
- Save and Close Extract Design
Select Correct Tutorial File and Set Basic Options
Before starting each tutorial, you must select the matching tutorial file and set some basic options and file properties.
Use the following Data Extractor tutorial files with the tutorials:
- Tutorial 1 - TUTOR1.REP
- Tutorial 2 - TUTOR1.REP
- Tutorial 3 - TUTOR3.REP
- Tutorial 4 - TUTOR4.REP
- Tutorial 5 - TUTOR5.REP
- Tutorial 6 - TUTOR6.REP
- Tutorial 7 - TUTOR7.TXT
To select a tutorial file and set basic options do the following:
- In the Data Extractor, click New Extract.
- In the Select the Text File window, navigate to the desired tutorial file in your default installation directory (Common800).
- Click Open. The report opens in the Data Extractor Data Panel.
- Open the Source Options window.
- In the Source Options window, select options that match the type and format of your text/report file.
- Close the Source Options window.
- Select Preferences from the menu and make sure "Close Definition Dialogs on Add/Update" is enabled.
Browse Data Records
Browse the data when you want to determine how your design choices have affected the data.
To browse the data records:
- Click Browse Data Record in the toolbar.
If there is only one Accept record, a message window appears saying something similar to "Fields Assigned to Accept Record Category".
If there are multiple Accept Records, you will be prompted to assign specific data fields to each Accept Record.
- Click OK.
All of the Data Fields appear in the Data Browser window for you to preview your data in a tabular (row and column) format and to verify you have defined everything correctly. If you wish, you may rearrange the data fields.
Rearrange Data Fields
If the fields in the Data Browser window are not in the order you want them to appear in the export data file, change the export field order.
To rearrange data fields:
- Select Field > Export Field Layout from the menu.
- Click and drag a field name to the desired position. A special symbol displays while you are dragging.
- Reopen the Data Browser window to view your export fields in the order they will appear in the export data file.
- Once you are satisfied with the appearance of the data, save and close your extract script design.
Save and Close Extract Design
After you have completed your Extractor script, save and close it for later use.
To save your script and close Data Extractor:
- Save your script by clicking Save Extract in the toolbar. The Save Extract window appears.
New extract scripts that have not been previously saved in Data Extractor display as "Extract: Extract1" in the title bar.
Note: Notice that the extract file name defaults to your workspace; however, the extension has been changed to .cxl. This is consistent with standard naming for Data Extractor scripts. You can change the Extract File Name, but the extension must remain .cxl.
- Navigate to your default installation directory (Common 800) and name the extract script file (for example, Tutor1.cxl).
- Enter a description of the tutorial, if desired, and click OK.
- Click the Close Extract icon in the toolbar.
- Exit Data Extractor by selecting File > Exit.
The Data Extractor Tutorials
The following is a list and brief description of each of the Data Extractor tutorials.
Data Extractor Tutorial 1 - The Basics
Tutorial 1 guides you through the basic steps to create and save a script file in Data Extractor. Later tutorials are more detailed.
This tutorial presents the fundamental concepts for using the Data Extractor. It is recommended that you do this tutorial first. The example file is a tagged list, but the procedure is useful regardless of the type of report. The best way to use this tutorial is to print a hard copy so you can follow the sequential steps.
Tutorial Goals
In this tutorial, you will learn:
- The basic process of creating an extract script
- How to save the script design
- New terms located throughout the documentation
Procedure
This tutorial is divided into three sections that should be completed in the order shown.
Define the Line Style - Accept Record
After selecting the tutorial file and setting up basic options, the first step in defining many extract scripts is to determine the line of data that marks the end of a record. In this case, the line with the string "Category:" is the last line of the first record.
After you identify the end of the record, define the line style for that line by marking the information that makes that line unique. In every record in this data file, the last line contains the string "Category:".
- Highlight the string "Category:" (including the colon following it).
- Right-click anywhere in the Data Panel, (the large white area of the screen) and select Define Line Style > New Line Style.
The Line Style Definition window appears.
Notice Data Extractor has already formed line recognition rules based on the information you highlighted. It searches for all lines that contain the string "Category:" in columns 15 through 23.
- To indicate that Data Extractor should accept the record at this point, ending one record and beginning the next, click the Line Action tab.
- Select ACCEPT Record.
- Click Add and proceed to Define the Line Style - Collect Fields.
The line style name, Category, now appears in the Line Style Column (the yellow column on the left of your screen) to mark that line as matching the Category: Line Style pattern. A bold green arrow displays designating that this is the Accept Record line. Scroll down in the data panel and notice that each line that matches the pattern you defined was automatically marked with the "Category" Line Style.
Define the Line Style - Collect Fields
In the TUTOR1 file, the first line of text that contains pertinent data is the line with the report date "13-Jul-95" (10th line). The dashes (and their positions) in this line make it unique and are likely to remain consistent even if the date changes in later reports.
- Highlight the first dash.
- Right-click in the Data Panel and select Define Line Style > New Line Style.
The Line Style Definition window appears. Notice that a pattern was created based on what you highlighted. Data Extractor looks for any line that contains a dash in column 13.
- Type a more descriptive Line Style Name, such as "Report_Date".
- Click the Line Action tab and leave the option set to COLLECT Field Contents.
The COLLECT Field Contents option causes any fields defined on this line to be included in the final output.
COLLECT Field Contents is the action you want for the majority of the lines in this type of report.
- Click Add.
- Locate and highlight the string "Problem No:".
- Right-click in the Data Panel and select Define Line Style > New Line Style.
The Line Style Definition window appears. Notice that Data Extractor generated a Line Style recognition pattern based on the highlighted string "Problem No:". Data Extractor also used the string "Problem_No:" to name the Line Style. You may rename the Line Style if you wish.
Data Extractor automatically selects COLLECT Field Contents as the line action. Since this is the option you want on most of the lines in this report, accept the default.
- Type a line style name.
- Click Add.
- Repeat steps 6 through 9 for each of the remaining lines in the first record.
Remember, the "Category" Line Style has already been defined, and it is the Accept Record line.
- Proceed to Define Data Fields.
Define Data Fields
After defining line styles for 14 lines of the first record, define the Data Fields. You have given Data Extractor the pattern information it needs to identify the lines in the report, now define what part of each line you consider to be useful data.
- Locate the line containing the date of the report.
The line shows only the report date, so all of the text on that line is important.
- Highlight the entire date.
The highlighted text is 1 row by 9 columns. The column and row numbers show at the bottom right part of the screen. Columns 11 through 19 on row 10 contain the date.
- Right-click in the Data Panel and select Define Data Field - New Data Field. The Field Definition window appears.
Notice the Field Definition option is set to Fixed Column in both the Start Rule and End Rule tabs. The Data Field starts in column 11 and ends in column 19, exactly where you highlighted.
- The Field Name defaults to "Report_Date_1" indicating that this is the first field on the Report_Date line. Change the default name to a more descriptive name, Report_Date, by typing it in the Field Name box.
- Click Add.
- Define the remaining Data Fields:
- On the Problem_No line, highlight from column 25 to 30.
This grabs enough space to include any larger numbers that might occur in later records.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
- The default Field Name is Problem_No_1. Problem_No is a descriptive name, but there is only one field on this line so the "_1" is unnecessary.
- Click in the Field Name box and backspace twice to delete the number and underscore.
Notice the Field Definition defaults to Fixed Column in both the Start Rule and End Rule tabs, starting in column 25 and ending in column 30.
- Click Add.
- Repeat step 6 for each remaining line of text on page 1 in TUTOR1.REP containing tagged data. See Table 3-2 below.
- Proceed to Browse Data Record in order to see how your data has changed.
- Rearrange Data Fields as needed.
- Save and Close your script.
Table 3-2: Tutorial 1 - Data Field Start and End Rules
| Data Field |
Starting Column |
Ending Column |
| Report_Date |
11 |
19 |
| Problem_No |
25 |
30 |
| Techie |
25 |
52 |
| Status |
25 |
52 |
| MMDDYY |
25 |
32 |
| Time |
25 |
32 |
| Serial_No |
25 |
39 |
| Version |
25 |
52 |
| Customer_Name |
25 |
52 |
| Company_Name |
25 |
52 |
| Phone_No |
25 |
52 |
| Source_Type |
25 |
52 |
| Target_Type |
25 |
52 |
| Category |
25 |
52 |
Data Extractor Tutorial 2 - Tagged Data and Automatic Features
Tutorial 2 guides you through the steps to create and save a script file using Data Extractor’s automatic processes. The source file for this tutorial is the same tagged-list used in Data Extractor Tutorial 1.
This tutorial introduces some of the useful timesaving features of Data Extractor that read and flatten a data file that contains tagged data. It is useful to anyone ready to learn about more advanced Data Extractor features. Tutorial 2 examines some quicker, more automatic ways to parse the same tagged-data used in Data Extractor Tutorial 1.
Things to remember when defining Data Fields and Line Styles in tagged data:
- When Data Extractor automatically creates Data Fields, it uses the positions you have highlighted to determine the length of the Data Field. Be sure and allocate enough space for data in subsequent records that are wider than the text you are currently selecting. For example, the Techie Name in the first record is "John". In a subsequent record it could be "Alexander Graham Bell IV".
- For tagged data, everything in the selection to the left of the Tag Separator is the Field Tag and everything to the right of the Tag Separator is the Data Field.
- When a Line Style is created, it is not just for the line you are working on but also for any line that matches the Line Style definition. This means that when you create a Line Style that looks for "Techie:" in columns 17 to 24, and there is a Data Field defined for that Line Style in columns 26 to 55, all lines that have "Techie:" in columns 17 to 24 have a Data Field in columns 26 to 55.
Tutorial Goals
In this tutorial, you will learn:
- How to create an extract script using automatic processes
- How to save the extract design as a script file
- New terms used throughout the Data Extractor documentation
Procedure
These steps should be completed in the order shown.
Define Data Fields
After selecting the tutorial file and setting up basic options, the first step in defining most extract scripts is to determine the line of data that marks the end of a record. In the TUTOR1 data file, the line of text that contains "Category:" marks the end of each record.
- Highlight the line that contains the string "Category:", up to column 45. Check the indicator in the lower right corner of the screen for column locations.
- Right-click anywhere in the Data Panel (the large white area of the screen) and select Define Data Field > Parse Tagged Data.
Note: Data Extractor automatically defines a Line Style with the string "Category:" in columns 15 through 23 as the recognition pattern, and a Line Action of Collect Fields, and names it "Category". It also creates a Data Field that collects any data on that line beginning at column 24, one space after the colon, and going to column 45, and names it "Category". The field is now defined, and the text turns red on the screen.
- If you wish to check the Data Field definition, you can double-click on the field itself (the red text) and the Field Definition window opens. Make any necessary changes, then click Update.
- Proceed to Define the Line Style - Accept Record.
Define the Line Style - Accept Record
Since the Category Line Style is the last line of the record, the Line Action should be Accept Record. When Data Extractor creates a line style automatically, it makes the line style Collect Fields, so the line action needs to be changed.
- Double-click on the Line Style Name, "Category" in this case, in the Line Style Column, the yellow column on the left part of your screen. The Line Style Definition window appears.
- Click on the Line Action tab and select ACCEPT Record [Including] This Line’s Fields from the list of choices.
- Click Update.
- View the data record by clicking on the Browse Data Record button in the button bar.
- Proceed to Adjust Data Field Definition.
Adjust Data Field Definition
- Select the entire Problem No line by left clicking on that line in the Line Style Column (the left yellow column).
- Right-click in the Data Panel (the large white part on the right) and select Define Data Field > Parse Tagged Data.
The Line Style pattern that Data Extractor automatically creates looks for Problem No: in positions 13 through 23.
- Double-click on Problem_No if you want to check it.
- Click Close to close the Line Style Definition window.
- To display the Field Definition window to view the information for the Problem_No: Data Field that was automatically generated, double-click anywhere in the Data Field where the text is red.
- Click the End Rule tab. Notice that the end rule is 52.
This is larger than the Problem No: Data Field needs to be, because it is defining the size of the Data Field all the way to the right margin of the report.
- Change the end rule of the Problem_No: field to 30.
- Click Update.
Notice the selected area on the Data panel for the Problem_No: Data Field is much smaller after the update.
- Proceed to Define the Header Information.
Define the Header Information
For this exercise, assume that the first line of the report contains information you want.
- Highlight the report name WINTECH on line 8 in positions 11 through 17.
- Right-click in the Data panel, and select Define Data Field > New Data Field.
The Field Definition window appears.
- The default Data Field Name is highlighted. Since there is no tag on this line, Data Extractor used the data itself as the Line Style name and Data Field name. Change the field name to ReportName by typing it in the Field Name box.
- Click Add.
- To define the report date Data Field, repeat steps 1 through 4, except highlight from columns 11 to 19 and name the field ReportDate.
- Proceeed to Update Line Style.
Update Line Style
The purpose of this exercise is to update the automatically generated "Jul95" Line Style to make it more generic for different report dates.
- To edit the "Jul95" Line Style, double-click on Jul95 in the Line Style Column.
The Line Style Definition window diplays. Notice that the Pattern for this Line Style looks for 13-Jul-95 in columns 11 to 19.
- Size the cells in the grid to view the information better, by following these steps:
- Position the mouse over the line in the header row of the grid where the column headings are. The mouse pointer becomes a bold vertical bar with arrows pointing to the left and right.
- Hold down the mouse button and drag the edge of the column to the left or right.
- Release the mouse button when the column is the desired size.
- If desired, adjust the height the same way using the gray border to the left where the triangle and asterisk are located.
- To change the pattern to look for a line with any date with the dd-mmm-yy format, click once in the Look For? cell on the first row of the grid where 13-Jul-95 is currently displayed.
A down arrow appears on the right side of that cell.
- Click on that arrow and the Pattern Builder window appears.
- TAB to the Value cell, delete the original value, and type a dash (-).
- Change the values of both the Begin and End cells to 13 by tabbing to them and typing in the correct number.
- Click OK.
Notice that the Look For?, Begin, and End values have changed in the Line Style Definition window to reflect the changes made in the Pattern Builder window.
- Add a new row to the Line Style Definition grid by clicking in the And/Or cell in the second row. Accept the value default of And.
- Click in the Search What? cell of the second row and click the down arrow.
- Select Column Range (m-n) from the displayed list.
- Select Contains from the list displayed in the Operator cell of the second row.
- Click on the arrow in the Look For? cell of the second row to display the Pattern Builder window again.
- to the Value cell, delete the original value, and type in a dash (-).
- Change the Begin and End values to 17.
Be careful to only enter a dash in the Value cell and do not leave any spaces around it.
- Click OK.
The line style definition should now match any line with a dash in position 13 and 17.
- Click Update to save the changes to the ReportDate Line Style.
- Proceed to Define Remaining Data Fields and Line Styles.
Define Remaining Data Fields and Line Styles
In this exercise, you will Define Data Fields and Line Styles for the Techie, Status, MM/DD/YY, Time, Ser #, Version, Customer Name, Company Name, Phone #, Source Type, and Target Type Tagged Data Fields.
- Highlight the Field Tag, the Tag Separator, and the data by dragging the mouse with the left mouse button depressed from the beginning of the Tag to the end of the Data Field.
Remember to extend out to the right to catch wider data in subsequent records.
- Right-click in the Data Panel and select Define Data Field > Parse Tagged Data.
Data Extractor creates a Line Style Definition and a Data Field Definition for you.
OR
- Click the line in the Line Style column to select it.
- Select Parse Tagged Data.
- Open the Field Definition window.
- Adjust settings.
- Click Update and Close.
Note:Data Extractor named the MM/DD/YY, Ser #, and Phone # Data Fields and corresponding Line Styles MMDDYY, Ser, and Phone. Also, Data Fields with embedded spaces are named with the spaces removed. This was done because Field Names can only contain letters, digits and underscores. Scroll down in the Data panel and see how the rest of the data is being defined.
- Browse the data records to see how your data has changed.
- If desired, rearrange the data fields as needed to meet your export file requirements.
- Save and close your script.
Tip: This file can be parsed even more automatically. If you wish to try it, follow these steps:
- Click the Clear Line Styles icon in the button bar.
- Highlight all the tagged data lines in the entire first record, beginning with the Problem No line and highlighting all the way down and including the Category line.
Be sure to catch all the field tags and data plus some extra space to the right.
- Right-click in the data panel and select Define Data Field > Parse Tagged Data.
The Data Extractor creates several new line styles and data fields at once. This method only works in cases of highly structured and consistent data. And it can be a great time saver when conditions are ideal.
Data Extractor Tutorial 3 - Columnar Data
Tutorial 3 guides you through the steps to create and save a script file in Data Extractor that reads and flattens a report containing columnar data.
In Tutorial 3, you convert the data in a columnar report file, to a flattened format, using the more automatic features of Data Extractor.
This tutorial introduces more of the time-saving features of Extract Schema Designer. Since a great many report formats contain columnar data of some kind, it is highly useful to anyone who wants to use Data Extractor.
By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.
Unlike the previous tutorials, this file has multiple Accept Record line styles in a single page of the report. The primary data record information is in the table detail lines. Each line is essentially a record. Each of these is an Accept Record line.
Tutorial Goals
In this tutorial, you will learn:
- How to create a script that reads and flattens a report with columnar data
- How to use more automatic features of Data Extractor
- How to save the script file
Procedure
The following steps should be completed in the order shown:
Define Line Styles and Data Fields for Detail Lines
After selecting the tutorial file and setting up basic options, define line styles and data fields. Data Extractor does the following when you complete this task:
- Divides the line into seven Data Fields using spaces as a column separator. The Data Fields are given default field names SALES/ MARKETING_1 through SALES/MARKETING_7.
- Creates a Line Style for the line. The Line Style that is automatically created has a default Line Name of SALESMARKETING. It identifies all lines in the report that have the string SALES/MARKETING in positions 1 through 16.
To define the line styles and data fields for detail lines:
- Select the first detail line (it begins with SALES/MARKETING) by clicking in the Line Style column (the narrow yellow stripe on the left) immediately to the left of the line.
This highlights the entire line of text.
- Right-click in the Line Style Column (the yellow part of the screen on the left) and select Parse Columnar Data.
- From the menu, select Preferences and click once on Close Definition Dialogs on Add/Update to disable the option.
- To view the definitions of the Data Fields created, double-click on the colored sections of the line.
For example, double-clicking on the green numbers 75,249 in the Data Panel brings up the SALES/MARKETING_2 Data Field in the Field Definition window. SALES/MARKETING_2 is the default Data Field name given to the second Data Field in the SALESMARKETING line.
It starts in position 20 and ends in position 27. Since it is defined for the Line Style SALESMARKETING, only lines that match that recognition pattern contain this Data Field in positions 20 through 27.
- Proceed to Change Data Field Names.
Change Data Field Names
The Browse Data Record uses the Data Field names as column headings for the Data Fields, so it is a good idea to change the Data Field names for SALES/MARKETING_1 through SALES/MARKETING_7 to more descriptive field names.
To change Data Field names:
- Double-click on one of the Data Fields in the SALESMARKETING line to open the Field Definition window.
- In the Field Definition window, highlight the default Field Name and replace it with a corresponding descriptive name.
See table 3-3 below.
- Click Update.
- To select the next Data Field, click the Field Name arrow to display a drop-down list of Data Fields that have been defined for the current Line Style.
- Select the next Data Field and continue until you have renamed all the fields. Close the Field Definition window when finished.
- Proceed to Change Line Style Name and Definition.
Table 3-3: Tutorial 3 - Suggested Data Field Names
| Default Name |
Suggested Name |
| SALES/MARKETING_1 |
Department |
| SALES/MARKETING_2 |
Team1 |
| SALES/MARKETING_3 |
Team2 |
| SALES/MARKETING_4 |
Team3 |
| SALES/MARKETING_5 |
Team4 |
| SALES/MARKETING_6 |
Team5 |
| SALES/MARKETING_7 |
DepartmentTotal |
Change Line Style Name and Definition
To view the new Line Style SALESMARKETING, double-click on the name SALESMARKETING in the Line Style column (the yellow column on the left of your screen). The Line Style Definition window appears.
Notice the SALESMARKETING Line Style is recognized by a pattern where columns 1 to 16 contain the string SALES/MARKETING.
To change the Line Style Name and Line Action:
- In the Line Style Definition window, change the Line Style Name by highlighting SALESMARKETING in the Line Style Name box and replacing it with Detail.
- Also in the Line Style Definition window, click the Line Action tab and select the ACCEPT Record Including option.
- Click Update.
- Proceed to Define Line Recognition Rules.
Define Line Recognition Rules
The Detail Line Style only matches lines that have SALES/MARKETING in columns 2 through 16. That is the recognition pattern that Data Extractor created automatically, but it is not the pattern that is needed in this case. The pattern needs to be general enough to match all of the detail lines in the text, but specific enough to match ONLY the detail lines. Update the Line Pattern so that the Line Style match all of the detail lines excluding the TEAM TOTALS line.
Analyze the detail lines to find what makes them unique in comparison to other lines in the text. Things to look for are position of the Data Fields, contents of the Data Fields, anything that is consistent for each of the detail lines but not contained in non-detail lines. For example, the detail lines contain:
- Commas in positions 24, 34 and 75 on every line
- Only letters, white space, and a "/" in columns 2 through 18
- Only digits, white space, and commas in columns 20 through 79
- A digit in position 78
- An upper case letter in each of the first 5 positions
Of all of the above observations, creating a pattern to look for uppercase letters in the first five positions is the best way to go. Here are some reasons why:
- Defining a pattern that would check for commas in positions 24, 34, and 75 requires three pattern lines and probably would not match every detail line in subsequent reports.
Suppose in this same report (created a week later) Team 2 of the Development department went to a pre-paid weeklong class and they only spent 100 dollars on supplies for the class. This means that a comma would not be in position 34 of that detail line so it would not match the Line Style, and the essential data on that line would be lost.
- Defining a pattern to check for letters, white space, and a "/" in columns 2 through 18 would require three pattern lines and would also match the column heading line.
- Defining a pattern to match lines that contains at least one digit in positions 20 through 79 and do not contain letters or "/" would require three pattern lines and it would match the detail lines. However, it also matches the Team Totals line.
- Defining a pattern to match lines that contain a digit in position 79 would match the detail lines and the Team Totals line.
To define a pattern that looks for upper case letters in positions 2 through 6:
- Click the Line Recognition Rules tab in the Line Style Definition window.
- Click once in the Look For? cell in the first row of the grid and click the down arrow.
The Pattern Builder window appears.
- In the Pattern Builder window, click in the Type cell and click the down arrow to display the allowable values for the Type field.
- Select character class from the list.
This tells Extract Schema Designer what kind of data it needs to match for that line style to be valid.
- Tab to the Value cell and click the arrow to display the allowable values for the Value field.
- Select upper case letters from the list.
This tells Data Extractor the specific data it needs to match for that line style to be valid.
- Change the value in the Count cell to 5 by highlighting the value there and typing a 5.
- Change the value of the Begin field to 2 and the End field to 6.
This tells Data Extractor where to look for the data you specified and how many of that particular data must be found for the line style to match that line.
- Click OK.
- Click Update to save the modified line style definition.
- Proceed to Define Data Fields.
Define Data Fields
In this part of the exercise, you will define the rest of the data in the record, starting with the report title.
To define the ReportTitle Data Field:
- Select the report title ABC CORPORATION BUDGET on line 1 by highlighting it in the Data Panel.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
The Field Definition window appears.
- Change the default name to ReportTitle.
- Click Add.
Data Extractor takes the selected text and Data Field name to automatically define a Data Field named ReportTitle and a line style as well, named ABC_CORPORATION_BUDG.
- Click Close.
- Double-click on ABC_CORPORATION_BUDG in the Line Style Column to display the Line Style Definition window.
Notice that Data Extractor automatically creates a recognition pattern that looks for the literal ABC CORPORATION BUDGET in positions 27 through 48.
- In the LineStyleName field, type ReportTitle.
- Click Update and Close.
- Proceed to Define Line Styles.
Define Line Styles
- Select the report date 10/26/95 on line 2 by highlighting the text with the mouse.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
The Field Definition window appears.
- Change the default name to ReportDate.
- Click Add and then Close.
Data Extractor takes the selected text and enters Data Field name and automatically define a Data Field and a Line Style.
- Double-click Style1 in the Line Style Column to display the Line Style Definition window.
Notice that Data Extractor automatically created a recognition pattern that looks for the literal "/" in positions 35 and 38.
- Rename the Line Style to ReportDate.
- Click Update and Close.
- Browse the data Records to see how your data has changed.
- If desired, rearrange the data fields as needed to meet your export file layout requirements.
- Save and close your script.
Data Extractor Tutorial 4 - Floating Tags
Tutorial 4 guides you through the steps to create and save a script in Data Extractor that reads and flattens a data file that containing floating tag data in a variable-length ASCII report.
This tutorial is useful to anyone likely to be working with floating tag data. By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.
Tutorial Goals
In this tutorial, you will learn:
- How to create a script that reads and flattens an ASCII report with floating tag data
- How to save the script file
- New terms located throughout the documentation
Procedure
The steps in this tutorial should be completed in the order shown.
Define Line Styles
After selecting the tutorial file and setting up basic options, find the patterns in this file and build recognition patterns (Line Style Definitions) so that Data Extractor can identify the lines with data.
The first characteristic of this report to notice is that each data record uses two lines of text. Another important characteristic is that several characters on each line are repeated consistently in the same position. These consistent patterns make it easy for you to build Line Style Definitions.
Data Extractor automatically creates a Line Style using ATTDOC in columns 19 through 24 as the Recognition Pattern and ATTDOC as the Line Style Name when you complete this task. Each line of text that matches this Line Style now displays the Line Style Name ATTDOC and a green arrow in the Line Style Column to the left of the text line.
To define Line Styles:
- Highlight the letters TRN in columns 15 through 17.
The letters TRN are in the same position in the first line of every record in the report. We could use the slash ( / ) in the third column or the colon ( : ) in the ninth column or any of several other consistent characters to identify the line, but the TRN is fine. Data Extractor needs only one consistent characteristic to identify a line.
- Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Action - Collect Fields.
- In the second line of text, highlight the string ATTDOC in columns 19 through 24.
These six letters appear in the same position in each second line of every record in the report.
- Right-click in the Data Panel, and select Define Line Style > Auto New Line Style > Action-Accept Record since this is the last line of every record.
- Proceed to Define Data Fields.
Define Data Fields
Notice that only the first couple of Data Fields in each of the TRN lines falls within the same columns from record to record. Define these fields first:
- In the TRN line, highlight the logged date and time data from columns 1 through 13.
- Right-click in the Data Panel, and select Define Data Field > New Data Field.
- In the Field Definition window, overwrite the default by typing Log in the Field Name box.
- Click Add.
- Highlight the string TRN from columns 15 through 17.
- Right-click and select Define Data Field .. New Data Field.
The Field Definition window appears.
- Overwrite the default field name by typing Trans_Type in the box.
- Click Add.
The TRN text changes to green all through the report indicating that it is the second data field defined on that line.
- Highlight the 12-digit number from columns 19 through 30.
- Right-click and select Define Data Field > New Data Field.
The Field Definition window appears.
- In the Field Definition window, type Trans_No in the Field Name box.
- Click Add.
The numeric text changes to blue in each of the TRN lines within the report indicating that it is the third field defined on that line.
- Proceed to Change Vertical Positioning.
Change Vertical Positioning
Because the patient and doctor names are different lengths, you cannot use Fixed Position to define the remainder of the Data Fields on the TRN lines. But because all of the fields other than the names have field labels with colons and spaces, Field Tags, you can define those fields as Floating Tag. "Floating" means that the Field Tags are not in the same position on the line in every record. If there were no Field Tags, you could still define the fields using Relative Word
Position.
The fourth field starts in the same column in each of the TRN lines so you can define Start Rule as Fixed Position for this field.
To change the Vertical Positioning Bar:
- Click Vertical Positioning Bar.
- Click at the beginning of the field to confirm that the field does indeed start in the same position in every record.
- Click Vertical Position Bar again to remove the red line.
The End Rule is Floating Tag because TIM:, the tag for the next field, always occurs at the end of this field.
- Proceed to Set Floating Tags - First Line of Text.
Set Floating Tags - First Line of Text
- Highlight the patient’s name from columns 32 through 47.
- Right-click and select Define Data Field > New Data Field.
The Field Definition window appears.
- Type Patient in the Field Name box.
- Click the End Rule tab.
- Click on the Floating Tag option.
Notice that the cursor is now blinking in the box to the right of the option.
- Type TIM: in the box.
This tells Data Extractor that this Data Field ends when the TIM: Field Tag is encountered.
- To prevent truncation, click the End Rule tab and set the Default FldLength to 30 bytes.
- Click Add.
Notice that the patient’s name does not change to colored text in the report. Fields defined as Floating Tag or Relative Word Position do not appear in colored text, nor are they underlined even if you have Underline Fields enabled in the Preferences menu. This is because those field positions are not the same in all records.
- Highlight the date and time data from columns 54 through 71.
- Right-click and select Define Data Field > New Data Field.
- Type Date_Time in the Field Name box.
- At the Start Rule tab select the Floating Tag option.
- Type TIM: in the box.
This tells Data Extractor that this Data Field starts immediately after the string TIM:.
- Click the End Rule tab and select the Floating Tag option.
- Type TYP: in the box.
This tells Data Extractor that this Data Field ends when the TYP: Field Tag is encountered.
- Click Add.
- Repeat the task for all except the last field (RATE).
- Proceed to Set End of Line - First Line of Text.
Set End of Line - First Line of Text
The RATE field at the end of the TRN line starts with a Floating Tag, but ends at the end of the line of text. Define this Data Field accordingly.
- Highlight the rate data from columns 93 through 97.
- Right-click and select Define Data Field > New Data Field.
- Type Rate in the Field Name box.
- On the Start Rule tab, select the Floating Tag option. Type RATE: in the box.
- Click the End Rule tab and click the End of Line option.
- Click the Data Collection/Output tab and set the Default FldLength in bytes for the field.
- Click Add.
- Proceed to Set Floating Tags - Second Line of Text.
Set Floating Tags - Second Line of Text
Look at the ATTDOC line of text in the records. Notice that the Data Fields in this line are also Floating Tag data. Follow these steps to define all the Data Fields except the last field.
- Highlight the attending doctor number from columns 29 through 34.
- Right-click and select Define Data Field .. New Data Field.
- Type Attdoc_No in the Field Name box.
- On the Start Rule tab, select the Floating Tag option. Type ATTDOC NO: in the box.
- Click the End Rule tab.
- Click the Floating Tag option. Type ATTDOC: in the box.
- Click the Data Collection/Output tab and set the Default FldLength in bytes for the field.
- Click Add.
- Repeat the steps (using the appropriate field names and tags) for the remainder of the Data Fields on the ATTDOC line, except the last field (BY).
- Proceed to Set End of Line - Second Line of Text.
Set End of Line - Second Line of Text
The BY field at the end of the ATTDOC line starts with a Floating Tag, but ends at the end of the line of text just like the RATE field in the first line. So, use the same steps as before, except use End of Line as the End Rule for that field.
- Highlight the field.
- Right-click and select Define Data Field .. New Data Field.
- Name the field.
- Set the Start Rule.
- Click the End Rule tab and click the End of Line option.
- Click the Data Collection/Output tab and det the Default FldLength in bytes for the field.
- Click Add.
- Browse the data records to see how your data has changed.
- If desired, rearrange the data fields as needed to meet the requirements of the export data file.
- Save and close your script.
Data Extractor Tutorial 5 - Columnar Data with a Footer
Tutorial 5 guides you through the steps to create and save a script file in Extract Schema Designer that reads and flattens a data file containing both detail lines and a footer line with data to extract.
This tutorial is useful to anyone likely to be working with columnar data with footer. Before doing this tutorial, it is recommended that you do Data Extractor Tutorial 3 - Columnar Data first.
By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.
Tutorial Goals
In this tutorial, you will learn:
- How to create a script that reads and flattens a data file with detail and footer lines
- How to save the script file
- New terms located throughout the documentation
Procedure
The steps in this tutorial should be completed in the order shown.
The primary data record information is in the table detail lines. This data is highly structured in neat consistent columns. Extract Schema Designer can build recognition patterns for Line Styles and Data Fields with this type of data automatically, saving you a lot of time and effort.
Define Line Styles and Data Fields
After selecting the tutorial file and setting up basic options, define line styles and data fields. Extract Schema Designer automatically creates a Line Style for the line and gives it a default Line Name of SALESMARKETING when you complete this task. Extract Schema Designer also automatically parses the line into 7 Data Fields using spaces as a column separator. The Data Fields are given default names of SALESMARKETING_1 through SALESMARKETING_7.
To define line styles and data fields:
- Select the first detail line (it begins with SALES/MARKETING) by clicking in the Line Style column immediately to the left of the line to highlight the entire line of text.
- Right-click in the Line Style Column (the yellow stripe on the left part of the screen), and select Parse Columnar Data.
- Proceed to Change Data Field Names.
Change Data Field Names
Since the Data Field names are used in the Browse Data Record as column headings for the Data Fields, change the Data Field names for SALES/MARKETING_1 through SALES/MARKETING_7 to more descriptive field names.
See the new, and more descriptive, names for the Data Fields in Table 3-4 below.
Table 3-4 Tutorial 5 - Suggested Data Field Names
| Default Name |
Suggested Name |
| SALES/MARKETING_1 |
Department |
| SALES/MARKETING_2 |
Team1 |
| SALES/MARKETING_3 |
Team2 |
| SALES/MARKETING_4 |
Team3 |
| SALES/MARKETING_5 |
Team4 |
| SALES/MARKETING_6 |
Team5 |
| SALES/MARKETING_7 |
DepartmentTotal |
To change Data Field names:
- In the Preferences menu, disable Close Definition Dialogs on Add/Update by unchecking it.
- Double-click on one of the Data Fields in the SALESMARKETING line to open the Field Definition window.
- In the Field Definition window, select the default field name, highlight it, and replace it with the corresponding descriptive name given above.
- Click Update.
- To select the next Data Field, click the Field Name arrow and a list of Data Fields that have been defined for the current Line Style is displayed. Select the next Data Field.
- Name the remaining Data Fields until you have named all the fields.
- Click Close.
- Double-click on the name, SALESMARKETING, in the Line Style column on the left of the screen.
The Line Style Definition window appears.
- Type in a new name, Detail.
- Click Update.
- Proceed to Define Recognition Patterns.
Define Recognition Patterns
The SALESMARKETING Line Style is recognized by a pattern where columns 2 to 16 contain the text SALES/MARKETING. This pattern matches only the first detail line. It needs be general enough to match all of the detail lines in the text, but specific enough to match only the detail lines, not the TEAM TOTALS line.
Analyze the detail lines to find what makes them unique in comparison to other lines in the text. Things to look for are position of the Data Fields, contents of the Data Fields, anything that is consistent for each of the detail lines but not contained in non-detail lines. For example, the detail lines contain:
- Commas in positions 24, 34 and 75 on every line.
- Only letters, white space, and a / in columns 2 through 18.
- Only digits, white space, and commas in columns 20 through 79.
- A digit in position 78.
- An upper case letter in each of the first 5 positions.
Of all of the above observations, creating a pattern to look for uppercase letters in the first 5 positions is the best way to go. Here are some reasons why:
- Defining a pattern that checks for commas in positions 24, 34, and 75 would require 3 pattern lines and probably would not match every detail line in subsequent reports. Suppose in this same report (created a week later) Team 2 of the Development department went to a pre-paid weeklong class and they only spent 100 dollars on supplies for the class. This means that a comma would not be in position 34 of that detail line so it would not match the Line Style.
- Defining a pattern to check for letters, white space, and a / in columns 2 through 18 would require three pattern lines and also matches the column heading line.
- Defining a pattern to match lines that contains at least one digit in positions 20 through 79 and does not contain letters or / would require 3 pattern lines. It also matches the Team Totals line.
- Defining a pattern to match lines that contain a digit in position 79 would match the detail lines and the Team Totals line.
So, the best pattern to use is one that looks for capital letters in columns 2 through 6.
To define a recognition pattern:
- In the Line Style Definition window, click once in the Look For? cell in the first row of the grid, then click the arrow to display the Pattern Builder window.
- Change the value of the Type field from literal to character class by clicking in the Type cell, then clicking the arrow to display the allowable values for the Type field. Select character class from the list.
- Click in the Value cell, then click the arrow to display the allowable values for the Value field. Select upper case letters from the list.
- Highlight the value in the Count field and change it to 5. The value of the Begin field should be 2. Change the value of the End field to 6 and click OK.
- Click in the empty cell in the seecond row of the And/Or column. The string And automatically displays in that cell.
- Click in the first empty cell in the Search What? Column. Then click on the down arrow and select Column Range (m-n) from the list.
- Click in the first empty cell in the Operator column. Then click on the down arrow and select Does Not Contain from the list.
- Click in the first empty cell in the Look For? column. Then click on the down arrow. This opens the Pattern Builder window.
- If the Value column is not empty, delete the contents of that cell. Press or place the mouse pointer in the cell in the Value column and click once to position the blinking cursor in that cell. Type a capital P.
- Change the values in the Count, Begin and End cells to 1.
- Click OK in the Pattern Builder window.
- Click the Update and Close in the Line Style Definition window.
Notice that Detail now appears beside each of the detail lines in the Line Style column, and not next to the Processing Date line.
- Proceed to Modify Data Fields.
Modify Data Fields
To modify the Data Fields in the Detail lines so the Data Extractor can extract the data on the last line of the report:
- Double-click in the Department Data Field, the red text at the beginning of each detail line.
The Field Definition window opens.
- Click on the Data Collection/Output tab, and click on the Array Field option to enable it.
- Click Update.
- Click the Data Field Name arrow and choose the next Data Field.
- Repeat this process for each field in any one line of text.
- Click Close.
- Proceed to Define Line Style.
Define Line Style
To define the PROCESSING DATE line:
- Highlight PROCESSING DATE:.
- Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Accept Record.
Extract Schema Designer defines the Line Style using the first word of the highlighted text in that position as the recognition pattern and named the Line Style PROCESSING.
- Proceed to Define Data Field.
Define Data Field
To define the Data Field on the PROCESSING_DATE line:
- Highlight the date from columns 17 through 24.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
- When the Field Definition window opens, change the default Field Name to Date.
- Click Add in the Field Definition window.
- Browse the data Records to see how your data has changed.
- Rearrange the data fields as needed to meet the requirements of your export data file.
- Save and close your script.
Data Extractor Tutorial 6 - Variable Length Multi-Line Data Fields
Tutorial 6 guides you through the steps to create a script that reads and flattens a data file containing data that extends across multiple lines of text and where the end of each record varies.
This tutorial is useful to anyone who has a report with fields that extend across more than one line, or has no consistent end of record line.
By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.
Tutorial Goals
In this tutorial, you will learn:
- How to create a script that reads and flattens a data file with varied record lengths
- How to save the script file
- New terms located throughout the documentation
Procedure
The steps in this tutorial should be completed in the order shown
Scroll through the data to get an idea of this file’s structure. Notice that there are eight or nine left-aligned Field Tags in each record. These tags can be used to easily identify and define the Line Styles in this report.
Define Line Styles
After selecting the tutorial file and setting up basic options, define your line styles.
- In the third line, highlight DATE: from columns 1 through column 5.
- Right-click with the mouse positioned anywhere in the Data Panel (the white part of the screen) and select Define Line Style > Auto New Line Style > Action-Collect fields.
Data Extractor automatically defines the Line Style with a Recognition Pattern of DATE: in columns 1 through 5 and name the Line Style DATE.
- Repeat the same basic procedure in step 2 for each of the following Field Tags in the first record. See Table 3-5 below.
Table 3-5 Tutorial 6-Field Tag Columns
| Field Tag |
Beginning Column |
Ending Column |
| RECORDATION |
1 |
12 |
| CONSIDERATION |
1 |
14 |
| SITE DIMENSIONS |
1 |
16 |
| SITE AREA |
1 |
10 |
| ZONING |
1 |
7 |
| REMARKS |
1 |
8 |
Note: You may also, if you wish, go to the second record and define the LEGAL DESCRIPTION: field in columns 1-18.
- In the 23rd line of the report use the mouse to highlight UNIT PRICE: from column 1 through column 11.
- Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Action-Accept Record.
Data Extractor automatically defines the Line Style with a Recognition Pattern of UNIT PRICE: in columns 1 through 11 and name the Line Style UNIT_PRICE.
To verify that the Line Style Definitions match the appropriate lines of text throughout the report, scroll down and see that each of the lines that contain a Field Tag has the corresponding Line Style Name in the Line Style Column to the left of the text line.
- Proceed to Define Data Fields.
Define Data Fields
- Highlight November 12, 1993 from columns 26 through column 42.
- Right-click and select Define Data Field > New Data Field. The Field Definition window appears.
The Field Name defaults to DATE_1.
- Type DATE to overwrite the default or click in the Field Name box and backspace over the _1.
- Click Add.
- Highlight Jeff County from columns 26 through 36.
- Right-click and select Define Data Field > New Data Field.
The Field Name defaults to RECORDATION_1.
- Type in something else if you wish or click with the mouse and backspace twice to remove _1.
- Click Add.
- Proceed to Set Continuation Rule.
Set Continuation Rule
Notice that some of the data you want to extract resides within a single line of text in one record but continues across multiple lines of text in other records. For example, the data in the CONSIDERATION field in the first record is on a single line of text, but in the second record, the data in the CONSIDERATION field continues across nine lines of text. This is easily defined using the Data Extractor feature called Continuation Rule.
- Highlight $333,000 Cash from columns 26 through 38.
- Right-click and select Define Data Field > New Data Field.
- Change the field name, if you wish.
- Click the End Rule tab and select the End of Line option.
- Click the Continuation Rule tab and select the Until Next Line Style option.
There is one extra step necessary for fields that are not fixed in length, which is setting the Default FldLength to prevent data truncation.
- Set the Default FldLength to 500 bytes on either the End Rule tab or the Data Collection/Output tab.
- Click Add.
- Ensure that Data Extractor is picking up all the data by clicking Browse Data Record, and widening the CONSIDERATION column.
- Define all of the remaining Data Fields.
- Browse the data records to see how your data has changed.
- Rearrange the data fields as needed to meet the requirements of the export data file.
- Save and close your script.
Data Extractor Tutorial 7 - Multiple Accept Records
Tutorial 7 guides you through the steps to create and save an extract script file in Data Extractor that uses multiple Accept Records.
You parse the data in a report file, TUTOR7 (supplied during installation), into a format suitable for exporting. By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.
Tutorial Goals
In this tutorial, you will learn:
- How to create a script that parses a report file into a multiple record file
- How to use multiple Accept Records in your script
- How to save the script file
- Terms used throughout the documentation
Procedure
The steps in this tutorial should be completed in the order shown.
Define a Line Style and Data Fields
After selecting the tutorial file and setting up basic options, begin creating line styles for the first Accept Record.
To create a Line Style and Data Fields for these lines:
- Select the first detail line (Parmer Lane Animal Hospital) by clicking in the Line Style column immediately to the left of the line. This highlights the entire line of text.
- Right-click anywhere in the Line Style column and select New Line Style.
The Line Style Name defaults to Parmer_Lane_Animal_H.
- Rename it HospitalLine.
- Click Add.
- Highlight Parmer Lane Animal Hospital in the Data Panel.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
The Field Name defaults to HospitalLine_1.
- Change the name to Hospital.
- Click Add.
- Right-click in the Line Style column to the left of April 1, 1999, and select New Line Style.
- Change the Line Style Name to ReportDateLine and click Add.
The data on this line is always centered beneath the HospitalLine. Depending on what month and day the report is run on, the data field may be longer or shorter than the current date.
- To make sure that your data field is wide enough, highlight the data from positions 31 through 57.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
- Change the field name to ReportDate.
- Click the Data Collection/Output tab.
- Make sure that the Trim Leading and Trailing Spaces box under Other Collection Options is checked.
- Click Add.
The first repeating Line Style that we want Extract Schema Designer to find is the Account Number.
- Click in the Line Style column to the left of 1101-01, then right-click anywhere in the Line Style column and select New Line Style.
- The Line Style Name defaults to STYLE1. Change it to AccountLine.
- Proceed to Define Line Recognition Rules.
Define Line Recognition Rules
Notice the entry under the Look For? column. Its default is in position 5. While this catches all pertinent lines in our example, it might not catch all instances in a larger record example.
To update the Line Recognition Rules:
- Click in the first cell under Look For?, then click the arrow.
- In the Pattern Builder window, click in the first cell under Type and select Mask from the list.
- Click in the first cell under Value. Keep the hyphen and type a pound sign (#) for each numeral, for example ####-##.
This tells Data Extractor that there are four digits followed by a hyphen, and then two more digits.
- Change the Begin position from 5 to 1.
- Change the End position from 5 to 7.
- Click OK.
- Click Add.
- Highlight the account number (1101-01) and right-click in the Data Panel.
- Select Define Data Field > New Data Field.
A Field Definition window appears.
- Change the Field Name from AccountLine_1 to AccountNo.
- Click Add.
- Before you continue, select Source Options from the tool bar and select the Flush Field Contents on Accept Default box under the Script Design Choices tab.
This flushes the data from the remaining fields in your report, unless you manually change a specific field to propagate the data.
- Click OK.
- Proceed to Define Line Styles and Data Fields.
Define Line Styles and Data Fields
- Select the first detail line under the first account number.
- Right-click in the Line Style column to the left of Robertson and select New Line Style.
A Line Style Definition window appears.
- Rename the Line Style from Robertson to LastNameLine.
- Click the Recognized By arrow and select Relative Position.
Note: The information under Line Recognition Rules changes. The default Base Line is AccountLine.
- If it is not, click in the first cell under Base Line and select Account Line from the drop-down list.
The default Line Count from Account Line is 1. However, there is a blank line between the AccountLine and the LastNameLine.
- Change the count from 1 to 2.
- Click Add.
- Highlight Robertson and continue out to position 35, in case someone further in the file has a very long last name.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
A Field Definition window appears.
- Change LastNameLine_1 to LastName and click Add.
- Right-click in the Line Style column to the left of Linda and select New Line Style.
A Line Style Definition window appears.
- Change the Line Style name from Linda to FirstNameLine.
- Click the Recognize By arrow and select Relative Position.
The Line Recognition Rules should default to three lines from AccountLine.
- If they do not, change the Count and the Base Line accordingly.
- Click Add.
- Now highlight Linda and continue out to position 35, in case someone further in the file has a very long first name.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
A Field Definition window appears.
- Change FirstNameLine_1 to FirstName.
- Click Add.
- Right-click in the Line Style column to the left of 143 Patterson Place and select New Line Style.
A Data Extractor Line Style Definition window appears.
- Change the Line Style name to Address1Line.
- Click the Recognize By arrow and select Relative Position.
The Line Recognition Rules should default to 4 lines from AccountLine.
- If not, change the Count and the Base Line accordingly and click Add.
- Highlight 143 Patterson Place and continue out to position 60, in case someone further in the file has a very long address.
- Right-click in the Data Panel and select Define Data Field > New Data Field.
A Field Definition window appears.
- Change Address1Line_1 to Address1 and click Add.
- Right-click in the Line Style column to the left of Austin TX 78759 and select New Line Style.
- In the Line Style Definition window, change the Line Style name to CSZ (for City, State, Zip).
- Click the Recognize By arrow and select Relative Position.
The Line Recognition Rules should default to 5 lines from AccountLine.
- If not, change the Count and the Base Line accordingly and click Add.
- Proceed to Define the Line Style - Accept Record.
Define the Line Style - Accept Record
Since this is the end of the first record (before other information becomes a subset of this information), make this line an Accept Record.
- Click on the Line Action tab and select Accept Record Including and click Add.
- Right-click the Line Style column again and select Parse Columnar Data.
- Double-click the red Austin field.
- Rename the Field Name from CSZ_1 to City.
- Click Update.
- Double-click on the green TX field.
- In the Data Extractor Field Definition window, rename the Field Name from CSZ_2 to State.
- Since this field is always be a two-character field, click on the End Rule tab and change the Fixed Column number to ZZ.
- Click Update.
The State field now has only the two letters underlined in green.
- Double-click on the blue 78759 field.
- In the Field Definition window, rename the Field Name from CSZ_3 to Zip.
- Since this field is always start at position 40, change the Start Rule Fixed Column accordingly.
- Since this field may go out to position 49, change the End Rule Fixed Column accordingly.
If you do not extend this field, the last four digits in the zip for the second record is not picked up.
- Click Update.
- View the data collected and rearrange data fields as needed.
- Proceed to Define Line Style for Pet Information.
Define Line Style for Pet Information
You must define a line style that recognizes each of the pet information lines. Analyze each of these records, to look for a common pattern to define the Line Style.
The Pet Type, Pet Name, Sex, and Color are all different in each record. The only thing that remains the same is the placement of Age, in positions 42 and 43.
To define your Line Style to recognize this pattern:
- Click in the Line Style column to the left of DSH-C.
- Right-click anywhere in the Line Style column and select New Line Style from the pop-up list.
A Line Style Definition window appears.
- Change the Line Style Name from DSHC to PetInfoLine.
- Click in the first cell under Look For?.
- Click on the down arrow that appears to the right of that cell.
The Pattern Builder window appears.
- Since you do not want to look for a specific character, change the Type from literal to character class.
- Click in the first cell under Value and select digits from the drop-down list.
- Since you can not be sure whether there are one or two digits for the Age, change the Count from 5 to 1.
- Change the Begin position from 3 to 42, the first possible Age position.
- Change the End position from 7 to 43, the last possible Age position and click OK.
Since the PetInfoLine may have visit information under it, it is be an Accept Record as well.
- Click on the Line Action tab and select ACCEPT Record Including.
- Click Add. Then scroll through the file to see if Data Extractor now recognizes all of the PetInfoLines.
You should see a green arrow to the left of each of lines.
- With the PetInfoLine still highlighted, right-click in the Line Style column again and select Parse Columnar Data.
- Since the data fields default to PetInfoLine_1 >through PetInfoLine-5, go up to Preferences and de-select Close Definition Dialogs on Add/Update.
- Rename each of the fields more appropriately by clicking the Field Name arrow, and selecting each new data field.
If you make a change to a data field, click Update before moving on to the next data field.
Use the following new names: PetBreed, PetName, PetAge, PetsSex, and PetsColon.
Note that the PetAge field does not extend far enough to catch the second digit of the age.
- Open the data field definition window and change the Start Rule to 42, and the End Rule to 43.
- Click Update.
The field is now properly aligned.
- Similarly, the PetsSex field is always a single character. So open the data field definition window and change the End Rule to 58.
- Click Update.
The field is now limited to a single character field.
- Open the data field definition window and change the End Rule to 90.
- Click Close.
- From the Preferences menu, enable Close Definition Dialogs on Add/Update.
- Since there may be office visit records for each pet, propagate the pet name so that it can be used with the visit date.
- Proceed to Change the Accept Record Behavior.
Change the Accept Record Behavior
To change the behavior on Accept:
- Double-click on Shiva to open the Field Definition window.
- Click the Data Collection/Output tab.
- Click Propagate Field Contents, then click Update.
- To view the data collected for this Accept Record, browse the data record.
- To select the Current Accept Record, click PetInfoLine in the middle of the screen.
- Click Assign to Current Accept Record.
- Click to the left of 02/18/99 on line 20 to define the VisitInfo line style.
- Right-click anywhere in the Line Style column and select New Line Style.
- Change the default NA Line Style Name to VisitInfoLine.
- Click on the Line Action tab and select ACCEPT Record Including.
- Click Add.
- With the VisitInfoLine still highlighted, right-click in the Line Style column again and select Parse Columnar Data.
- Since the data fields default to VisitInfoLine_1 through PetInfoLine-3, go to Preferences and disable Close Definition Dialogs on Add/Update.
- Double-click the first data field in the VisitInfoLine.
- Rename each of the fields appropriately, by clicking the Field Name arrow, and selecting each new data field.
If you make a change to a data field, click Update before moving on to the next data field.
Use the following new names: VisitDate, Diagnosis, and Service.
- Since the VisitDate field is always a fixed length, change the End Rule to 15.
- Change the End Rule on the Service field to 105 to be sure that it collects all the information in that field.
- Click Update and then Close.
- From Preferences, re-check Close Definition Dialogs on Add/Update.
- Assign the account number to each pet name, and associate the pet name and account number with each office visit by selecting Record from the menu, and selecting Edit Accept Record.
- Select PetInfoLine as the Current Accept Record. Then click Show Fields in the upper right portion of the Accept Record Definition window.
- Click the AccountLine check box.
- Click Update.
- Browse the data to see that the account number now shows up at the bottom of the PetInfoLine fields.
- Click Record in the top toolbar again, and select Edit Accept Record.
- Select VisitInfoLine as the Current Accept Record.
- Click Show Fields in the upper right portion of the Accept Record Definition window.
- Since you want to add the account number to this Accept Record, click the AccountLine check box.
- Add the pet’s name to this Accept Record by clicking the PetName check box.
- Click Update.
- Click on the VisitInfoLine in the middle of the screen to select the Current Accept Record.
- Click Assign to Current Accept Record.
- Scroll through the data to see that the visit information shows up under the pet name.
- Browse the data records to see how your data has changed.
- Rearrange the data fields as needed to meet the requirements of your export data file.
- Save and close your script.
Introduction to Basic Elements
When working with report or text files in the Data Extractor, there are three basic elements to help you accomplish your task. Those basic elements are line style, data field, and field content.
A line style is what you define that tells the Data Extractor how to identify a particular line of text in the report. You want to define each line style in such a way that the Data Extractor can identify that same line of text throughout the report. The trick to defining a good line style is to make the recognition rules specific enough to not include any lines you do not want recognized, yet broad enough to include all lines you do want recognized. See Defining Line Styles.
A data field is what you define that tells the Data Extractor which specific portions of the report you want to extract and assemble into data records. There are options in the Data Extractor that let you determine which data fields are collected and assembled as part of each output data record. For details, see Define Data Fields.
The field content is the data that occupies each defined data field. Any defined data field may contain data or may be blank in any given record. There are options in the Data Extractor that let you determine whether or not the field contents of one data field are carried forward to subsequent data fields that are blank. For details on the Flush Field Contents, Propagate Field Contents, and Flush Field Contents on Accept Default options, see Define Data Fields and Source Options Window.
When using Data Extractor to extract data from a text or report file, you follow some general procedures. It is important to understand that the goal is to define line styles and data fields in such a manner that the Data Extractor is able to assemble records of data out of the information contained in the report.
Some Helpful Tips
These sections offer some helpful hints and tips that may make your task easier. Since every report or text file is different, these subjects are more general in nature. There are more specific examples offered in other sections of the documentation.
Finding Logical Record Breaks
Your goal is to extract useful data out of your report or text file and then assemble that data into a field-and-record-oriented format. Therefore, one of your initial steps should be to examine the report and find the logical record breaks. When you locate a logical record break, you define a line of text as the ACCEPT Record to mark where the Data Extractor should stop collecting data fields and assemble a data record.
Some types of reports are formatted in such a way that logical record breaks are easy to locate and the ACCEPT Record is easy to define. Some examples follow:
- Reports where each page contains the information that comprises one record, and the last line of text is defined as the ACCEPT Record. See Extract Schema Tutorial 1 - The Basics and Extract Schema Tutorial 2 - Tagged Data and Automatic Features.
- Columnar reports where each line of text comprises one record, and each line is defined as the ACCEPT Record. See Extract Schema Tutorial 3 - Columnar Data.
- Variable-length ASCII files where each record is derived from a consistent number of lines of text, and the last line of each record is defined as the ACCEPT Record. See Extract Schema Tutorial 4 - Floating Tags.
Other types of reports are not so easy. Some examples follow:
- Reports that contain detail lines and a footer with data to be extracted. See Extract Schema Tutorial 5 - Columnar Data with a Footer.
- Reports that contain data that extends across multiple lines of text within one data field. See Extract Schema Tutorial 6 - Variable Length Multi Line Data Fields.
- Reports that contain data that fits into more than on logical record type. See Extract Schema Tutorial 7 - Multiple Accept Records.
When your report is formatted in such a way that defining the end of a record is difficult, sometimes the only way to handle this is to define the beginning of a record. You can use the ACCEPT Record option that tells the Data Extractor to assemble a record, but collect the data fields Before Collecting this line’s fields. This ends the field collection action on the last defined line of text that falls BEFORE the ACCEPT Record line, and places any data fields on the ACCEPT Record line in the next record.
Basic Steps
These are the basic steps required for creating an Extract script.
- Open a text file, report file, or URL file with the Data Extractor.
- Define line styles for each line in source file that contains information to be extracted.
- Within each defined line, you may define one or more data fields.
- After defining all the needed line styles and data fields, save the script.
- Export the extracted data.
The following sections explain the details of these procedures.
- How to Create a Report File
- Defining Line Styles
- Defining Data Fields
- Saving an Extract Script
Open a Text File or URI
To Create a New Extract
Follow these steps to open a text file, report file, or URI in the Data Extractor.
- Click the New Extract icon image\newscr.gifin the toolbar, or select New Extract from the File menu.
- In the Select the Text File window, choose your source file in one of three ways:
- Type the drive, directory path, and filename directly in the Source File text box and click OK.
- Click the arrow and browse to build the path and file name for the source file.
- Type the complete URI addressing scheme (e.g., http://www.yahoo.com/) in the Source File text box.
When the file or URI is selected, the text displays in the Data Panel section of the Data Extractor main window.
All file types that the Data Extractor can open appear in the Files of Type box in the Select Report/Text File window. Because the Data Extractor supports so many different types, not all extensions are visible within the Files of Type box. However, all available file types appear in the selection window, including those files with extensions not viewable in the Files of Type box.
Note: A minimum of v1.4.0 of Sun Java Runtime must be installed for the URI support to work. Without this component, you will receive the error message: "Unable to load Java virtual machine."
Note: Extract scripts (.CXL files) derived from a URI source connection are now stored in the default workspace directory called Extracts. For more information, see URI Support.
- Scroll through your file and locate a page or record that best represents the perfect record; i.e., one that is most representative of all the records. In the Source Options window, set the Sample Size to include this portion of the file. (See Source Options Window.)
Also, study the overall formatting of the information you want to extract, looking for any tagged or columnar data in the report. Pay special attention to the tag separators, field separators, and column separators. Again, in the Source Options Window, make appropriate selections from the list of available options.
You are now ready to begin defining line styles and data fields.
If you need to quit the Data Extractor and come back to your work at another time, save the extract. This also saves your work in a database file called extractor800.mdb in your \Common800 directory.
To open an existing extract with the same report:
If you have already defined some line styles and data fields in a report and saved that work, you can open the extract and report in another session and continue your work or make modifications to it.
If you have created and saved a complete extract, you can open the saved Extract and run an export.
To open an existing extract and report, double-click it with the mouse in the Extract Manager, or select it then select File > Open Extract.
Note: Extract scripts (.CXL files) derived from a URI source connection are stored in the default workspace directory called Extracts.
To open an existing extract with a different report:
You can open a previously-designed script with a different report or text file to check on script compatibility.
- Open the extract normally, either by double clicking on it, clicking Open Extract, or by selecting File > Open Extract.
- Select Source > Options.
- Click the File Properties tab.
- Click the Text File arrow and browse to choose the new report.
Extract Tuning Tips
Because the Data Extractor reads and evaluates each data line, anytime you have many Line Styles (usually more than five), consider these tips to speed up the performance of your extract scripts.
- Change the order in which your Line Styles are checked. Position the most frequently found Line Styles at the top of your Line Style list. This way, "hits" that occur early on mark these lines and save the Data Extractor tons of comparison time. To change the Line Style order, select Line > ReOrder Line Styles.
Note: Do not alter the Line Style order when it is used to control which of multiple Line Styles could satisfy a line "hit". If you do, you might find your lines incorrectly marked.
- Keeping in mind that every line is checked against the existing Line Styles, consider defining a REJECT Line Style that will "hit" lines you do NOT want (i.e., blank lines). If these lines appear frequently in your file, consider moving the REJECT line up near the top of your Line Style order. Doing this saves a considerable amount of comparison time.
Defining Line Styles
Each line in your text or report file that contains information to be extracted must be defined with a line style. In addition, other lines may need to be defined as reference lines. Each line style consists of a recognition rule and a line action.
A line style definition is what the Data Extractor uses to identify particular lines of information within your report. For each set of information that you want to assemble into a record of data there may be one or more line styles. The number of line styles needed is determined by how many lines of text in the report contain that set of information. For example:
- If the report contains header information that includes a date that you want to include as a field within each data record, define a line style for the header line that contains the date.
- If a detail line in the report contains information that you want to include as fields within each data record, define a line style for the detail line.
- If the date from the header line and the information from the detail line make one complete data record, define only those two line styles in the report.
In most cases, it is best if you do not define any lines of text from which you are not extracting data. This keeps the script file smaller and more efficient. Lines for which there is no line style defined are ignored. The exception to this rule is that in some cases performance can be improved by defining certain large repetitive sections as Skip lines.
Note: If your source file contains tabs and you want to use the Auto Line Style feature, you must first update your tab expansion setting. To do this, select Source > Options from the menu. Then, select the Printer Emulation tab and change the Tab Expansion setting to 0.
Recognition Rules
The recognition rule portion of a line style definition contains criteria by which the Data Extractor identifies a line of text. In other words, you tell the Data Extractor how to recognize a line or lines of the report by defining a set of criteria. After you define a line style in one section of your report, the Data Extractor compares each line of text in your entire report file with that recognition rule. For each line of text that matches the recognition rule, the Data Extractor assigns that line style. The line style name displays in the Line Style column to the left of the Data Panel for each matching line of text in the report.
The trick to defining a good recognition rule is to make it specific enough to NOT include any lines you do NOT want recognized, and broad enough to include ALL the lines you DO want recognized.
You may define line styles manually or let the Data Extractor automatically define them, depending upon whether or not your data can be handled by the Data Extractor's automatic features. For details about this option, see File Menu and Pop-up Menus.
Note: You may want to utilize the more advanced features after becoming familiar with the basic procedures. Tutorial 1 will help you get acquainted with the basics of the Data Extractor, and Tutorials 2 and 3 will introduce you to some of the time saving advantages of the Advanced features.
If you are defining a line style manually, the Data Extractor suggests a recognition rule based on a pattern that displays in the Line Style Definition window when it opens. If you have highlighted a particular portion of the line, this portion is automatically suggested to create the recognition rule. You may modify the suggested recognition rule before adding it to the Data Extractor database and script. Details about different ways to define line styles are found in this documentation.
The manner in which you define line styles depends on a number of factors, including your own personal approach to a task. The other major factor is the type of text or report file with which you are working. The sections below should help you determine the best approach.
Remember the selections in the Source Options window should be examined and possibly modified prior to defining line styles. For details about the available options, see Source Options Window.
Recognized by
When you define a line style, you are specifying a recognition rule that the Data Extractor uses to identify any line of text in your file that matches that rule. Recognition rules are built in the Line Style Definition window. Each line style recognition rule is based on one of seven basic recognition styles. The available styles are described below.
Each style consists of an expression that specifies a search criterion. To see all the available options for building recognition rules, see Line Style Definition Window.
The following are some common, but brief, examples of where and how to use each of the seven basic Recognition styles:
Pattern
If you see that there is a unique string of text on one type of line that does not appear on any other lines, but always appears on that type of line, highlight that text and it becomes the pattern that the Data Extractor uses to identify that line in each record. In some cases, the unique string of text may even be a single character in a consistent position.
This is the most common style to use for a recognition rule. It offers the most flexibility, but can also be the most difficult to define. Patterns are defined by either single-row or multiple-row expressions. Recognition patterns are described in more detail later in this documentation.
Relative Position
If the line of text you are defining appears in the same relative position to some other line, a Base Line, you might define the recognition rule by its relative position to that other line. The Relative Position option allows you to specify one or more lines above or below the Base Line. The line of text you want to use as a Base Line must be defined before you define a line style that refers to it.
Exact Line Number
If the line of text you are defining appears on only one line of text in the report, you might define the recognition rule by its Exact Line Number. This option is very rarely used because it can only be used to recognize lines that occur only once in a report.
Blank Line
If you need to define the lines of the report that do not contain any text, you can define the recognition rule for them by Blank Line.
It is not necessary to define the blank lines in your report except when you need to use them as Base Lines, Accept lines, or markers of some kind.
All Undefined Lines
After you have defined all the necessary lines of text in your report, you may want to use this recognition rule to define all of the remaining undefined lines. Alternately, if all but a few of the lines of your report need to be recognized in the same way, you may want to define the lines you do not want, and then use All Undefined Lines to define the lines that you do want. Another reason you might want to define all undefined lines is for use in the Debug Extract Design Window.
The default line action for this option is COLLECT Fields, but you may select any Action that fits your needs.
Note: It is not necessary to define the undefined lines.
Pattern & Relative Position
Select this option when you want the Data Extractor to define a line style based on its relative position to another line AND by some specific characters or types of characters. Before selecting this option you must have already defined another line to use as your Base Line.
The recognition rule options for Pattern and for Relative Position, as well as special options for using both, are described in detail in this documentation.
The default line action for this option is COLLECT Fields, but you may select any Action that fits your needs.
Non-Blank Line
Select this option when you want the Data Extractor to use this line style to define all lines in the report that contain anything other than spaces and an end of line character.
There is no recognition rule for this option. Its behavior is automatic.
More About Line Styles
Once you have defined a line style and added it to the script, the line style name displays in the Line Style Column to the left of all lines that match that recognition rule.
If a line style name appears on any line that should not be included, select Edit Line Style and make the recognition rule more specific so the unwanted lines of text do not meet the recognition rule's criteria.
If a line style name does not appear on any line that should be included, select Edit Line Style and make the recognition rule broader so the needed lines of text match the recognition rule's criteria.
If the name of the line style you just defined does not appear on any line at all in the Line Style Column, then the recognition rule does not match any line in the report. Place the mouse in the Line Style Column of any blank line and double-click, or select Line > Edit Line Style. When the Line Style Definition window opens, select the line style name you want to modify from the Line Name list box, and edit the recognition rule so the expected lines of text meet the recognition rule's criteria.
Modify a Recognition Rule
If you need to modify the recognition rule of a line style, highlight any part of a line. Then select Line Style Column4 > Edit Line Style, or place the mouse pointer on a line style name in the Line Style Column and double-click. When the Line Style Definition window opens, modify the recognition rule and Update it.
How the Data Extractor Builds Recognition Patterns
The following sections describe how the Data Extractor builds recognition patterns.
New Line Style
When you select New Line Style from the menu, the Data Extractor creates a suggested line style recognition pattern that displays when the Line Style Definition window opens. For details, see Recognition Rules. If you highlighted a piece of the data on the line before making the selection, the Data Extractor uses the data you highlighted to form the recognition rule.
If the pattern meets your needs and the default line style name is acceptable, select a line action and then click Add to accept the recognition pattern. See Line Action. The Line Style Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.
If you need to change the pattern or want to change the line style name, you may enter a new name or modify the pattern, then select a line action, and click Add. The Line Style Definition window closes unless you have turned the Close Definition Dialogs on Add/Update OFF in the Preferences menu.
Auto New Line Style
When you select Auto New Line Style from the menu, another menu prompts you to select a line action. See Line Action. After you select a line action, the Data Extractor creates a line style recognition pattern automatically.
The default line style name displays in the Line Style Column for each line of text in your report file that matches the recognition pattern.
If you need to modify the line style recognition pattern or the line style name, double-click the line style name in the Line Style Column to open the Line Style Definition window and edit the line style recognition pattern and/or name. After making the desired modifications, click Update. The Line Style Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.
Suggested Approach - Defining Line Styles
After opening the report in the Data Extractor, scroll through your report file and choose a section of the file with which to work. Identify a section that is most representative of the entire file. Then open the Source Options window to specify the sample of the file with which you want to work. For details about how to set the sample size, see Source Options Window. You can move from one section of the report to another as you go along, if your report does not have any one section with all of the data you need to define.
Examine your text or report file before starting to define line styles to determine if some of the default options need to be changed in the Source Options window. For details about the available options, see Source Options Window. After making the necessary selections in the Source Options window, follow the steps below.
To define a line style:
- Highlight some selected text in the Data Panel that can be used as a line style recognition pattern.
- Right-click with the mouse positioned anywhere in the Data Panel (the large white area of the window).
- Select Define Line Style4New Line Style.
- From the Line Style Definition window, change the default line style name, if desired.
- Notice that the Data Extractor entered a suggested line style recognition pattern based on what you highlighted. Check to see if that pattern is acceptable.
- Click the Line Action tab and select a line action if you need an action other than COLLECT Fields.
- If everything is acceptable, click Add.
The Line Style Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.
- When the Line Style Definition window closes, scroll through the report and verify that all of the matching lines in each section of the report now have been recognized with this line style.
- Edit the line style definition, if needed.
- Repeat the process until you have defined all the lines of text that contain the information you want to assemble into one data record.
In most cases, it is best if you do not define any lines of text from which you are not extracting data.
Tip: If you know that the pattern you have highlighted will work for this line style, select Auto New Line Style from the second level pop-up menu rather than simply New Line Style. This will cause a third level pop-up menu to appear with a list of line actions. Choose the line action you need for this type of line. The line style will be created without even opening the Line Style Definition window. You will see it appear in the Line Style Column next to every line with the highlighted pattern. For more information, see Understanding Line Style Behavior below.
Understanding Line Style Behavior
Each of the following sections describes methods you may use to highlight various types of data in your text or report file, and how the Data Extractor creates line style recognition rules based on what you highlighted.
When the Data Extractor searches a line of text to build a recognition rule, by default, it is based on a pattern. The Data Extractor searches for the following in the order listed here.
- Exact highlighted text, if less than entire line
- If entire line is highlighted:
- Field Tags
- Special Characters
- The First Field on the line of text
These are mutually exclusive, meaning if not field tags, then special characters; if not special characters, then first field, etc. The Data Extractor does not use field tags AND special characters AND first field.
Exact Highlighted Text
If you use the mouse to highlight one or more characters on a line of text, and select New Line Style or Auto New Line Style from the menu, the Data Extractor uses the highlighted text as the line style recognition pattern. This is usually a good way to begin defining a line style. Additions or modifications can be made in the Line Style Definition window. See Line Style Definition Window.
Highlight Selected Text in the Data Panel - Single Line
Highlight some selected text in the Data Panel, anything less than the total line. Zero in on the exact text, even down to one character, if necessary, that you want the Data Extractor to use as the recognition pattern. Select Define Line Style > Auto New Line Style, and then select a line action.
The Data Extractor uses the exact highlighted text as the recognition pattern for that line style and then assigns a unique line style name.
With the Append Line Pattern option at the second level pop-up menu, you can easily add to the recognition pattern. Simply highlight another piece of text that you want added to the recognition rules for that line and select Append Line Pattern from the menu. For details about Append Line Pattern, see Pop-up Menus.
Highlight an Entire Line of Text - Single Line
Highlight an entire line of text by clicking in the yellow Line Style Column to the left of the line. Select Define Line Style > Auto New Line Style, and then select a line action. The Data Extractor searches the entire line for:
- A field tag, based on which tag separator was selected in the Source Options window.
- Any of the special characters listed above.
- The first field in the line of text, and uses whichever it finds first as the recognition pattern for that line style.
Field Tags
Many reports contain field tags, and the Data Extractor was developed to make use of these tags. Field tags may be used as line style recognition patterns or as field names for data fields.
Field tags are usually descriptive words that identify the data that follows the tag. In report files, there is usually a character that separates the field tag from its data, such as a colon or a hyphen. You may specify the tag separator on a line-by-line basis or for the whole report file in the Source Options Window.
A valid field tag must contain three words or less to the left of the tag separator. The Data Extractor finds field tags by searching everything to the left of the specified tag separator until it finds two spaces or the left margin of the report. When identifying field tags, the Data Extractor always uses a space to distinguish one word from the next.
Some examples of how the word "Name" may be used as a field tag:
- Name:John M. Smith (a colon is the separator)
- Name-John M. Smith (a hyphen/dash is the separator)
- Name John M. Smith (2+ spaces is the separator)
Some examples of multi-word field tags with a colon and a space as the separator:
- Name of Business: ABCD Corporation
- First Name: Mary
Special Characters
For the purpose of extracting data out of text or report files, there are different types of printable information. For easier identification here, they are categorized into three groups, as follows:
- Letters (A-Z and a-z)
- Usually found in information such as a person's name, an address, a description of an item, etc.
- Digits (0 - 9)
- May be found in addresses, zip codes, phone numbers, currency amounts, dates, etc.
- Special characters
- Found in certain types of information that may be further categorized as specific formats of some kind, such as a date, a currency amount, or other specialized types of data. For this reason, the Data Extractor looks for these special characters when analyzing a line of text and tries to use them to build a recognition pattern.
- These are the special characters that the Data Extractor searches for and where they might be found:
- Period ( . ) - commonly found in numbers containing decimal places
- Forward Slash ( / ) - commonly found in dates
- Dash ( - ) - commonly found in dates, zip codes, and telephone numbers, but may also be a tag separator. For details on tag separators, see Source Options Window.
- Colon ( : ) - commonly found in time data, but often is used as a tag separator. For details on tag separators, see Source Options Window.
First Field
First field is the first section of data on the line of text before the first fields separator (a space by default, but this can be reset in the Source Options window). More detailed information about how the Data Extractor automatically defines line styles can be found in the "How the Data Extractor Builds Recognition Patterns" section below. An explanation of each of the above search methods is described here.
Highlight Selected Text in the Data Panel - Multiple Lines
Highlighting a block of text in the Data Panel that includes more than one line of text is typically done with columnar reports, for detail lines within a report, or for lists of tagged data as in the TUTOR1.REP. Examples of each can be found in the Data Extractor tutorials.
Other Types of Lines
As you define the line styles in your report, you may find there are some lines of text that require special handling. This section examples those special kinds of lines and how you might define them with the Data Extractor. These include the following:
- Blank Lines
- Header Lines
- Footer Lines
Blank Lines
The Data Extractor automatically and invisibly, defines all blank lines in a text or report file and sets the line action to REJECT Line. No line style name displays in the Line Style Column. This keeps the Line Style Column less cluttered and enhances the Data Extractor performance. By identifying and rejecting blank lines, the Data Extractor reads the lines that contain information more quickly.
If you need to use a blank line as a reference line or Accept line, define it appropriately.
Header Lines
When header lines are present in your report file, one or more of those lines might contain information you want to extract and assemble as part of a data record. For example, there may be a report title, report date, or account number that you want as a data field in each record in your target file. Treat these lines as you would any other ordinary line that contains data that you need in your output file. First, define a line style for any header line that contains data you want to extract. The line action is usually going to be COLLECT Fields. Then, in each of those lines, define each data field you want to extract.
Footer Lines
When footer lines are present in your report file, one or more of those lines might contain information that you want to extract and assemble as part of a data record. The Data Extractor can extract data out of footer lines that follow columnar data.
Define the line styles of any previous lines in the report. (Where you would normally select ACCEPT Record as the line action for the detail lines, select COLLECT Fields as the line action.) When you define the data fields within the detail lines, go to the Data Collection/Output tab in the Field Definition Window and turn Array Field ON for each field.
Proceed to the footer section and define the line or lines that contain the desired data. Select ACCEPT Record as the line action for the last defined line style. When you define the data fields on the footer lines, do not turn Array Field ON for these fields.
Line Style Names
When you are defining line styles, the Data Extractor assigns line style names as follows:
- If selected text is highlighted in the Data Panel, the Data Extractor uses the alpha characters of the highlighted text as the default line style name. If there are no alpha characters in the section of text that is highlighted, the Data Extractor names the line Style1, Style2, etc.
- If an entire line is highlighted and the Data Extractor finds a valid field tag on the line from which to build the recognition pattern, the default line style name is the same as the tag excluding the separator.
- If an entire line is highlighted and no field tag is found, the default line style name is created at design time as follows. Human navigation is completed to the desired web page (entering the requisite Passwords, etc.), and then a SaveAs to a local file is performed. That file can be brought in and processed using Data Extractor.
You may rename any line style in the Line Style Definition window, but it must meet requirements. Line style names must be unique and are limited to 20 characters. They may include upper and/or lower case letters (A - Z, a - z), numbers (0 - 9), and underscores ( _ ), but may not begin with a number. Line style names may not include spaces.
Line style names do not appear in the output file in any way. They exist in the Data Extractor only to help you distinguish between the different lines for which you have defined recognition rules and line actions.
Line Action
The second important part of a line style is the line action.
When defining a line style, you must select an Action that determines how the Data Extractor processes the lines of text or the data within the lines of text that matches the line style.
When you have made all the needed choices within the Line Style Definition Window, complete one of the following steps:
- Click Add if you are adding a new line style.
- Click Update after making modifications to an existing line style.
Defining Data Fields
Each data field that you define in the Data Extractor holds a particular piece of information that you want to extract from your report or text file. Each "set" of data fields that you define in the Data Extractor is exported as a record. A combination of the Line Style Definitions and Data Field Definitions determine how the data is extracted and assembled into data records.
This topic provides information about the various ways to define data fields using the Data Extractor.
Select the Define Data Field option after you have highlighted some text in the Data Panel and want to define one or more data fields. A second level menu appears. The information in each section below describes the menu options that are available in the second level menu.
If the data formats and/or techniques in this topic do not seem to fit your specific needs, please refer to Data Fields - Advanced Options.
Parse Columnar Data
Select the Parse Columnar Data option if you want the Data Extractor to analyze the highlighted text, define the line style automatically, and parse the line of text based on the selected column separator. For details about Column Separator options, see Source Options Window.
For example, if you have a line of text that looks something like the following, you might want the Data Extractor to parse it for you. First, select # of Spaces + as the column separator. Then select Parse Columnar Data from the menu.
| 09/15/95 |
Phone Bill |
Debit |
$120.45 |
The Data Extractor creates a line style that looks for the decimal point in the last column and the two slashes in the first column, and names it Phone Bill. It automatically parses this line into four data fields.
image\prscoldatfldex.gif
This option is only available if the highlighted line of text contains the column separator specified in Source Options between some text within the line. If there is no obvious pattern for the Data Extractor to automatically create a line style, a message appears letting you know that the Data Extractor cannot create the line style for you. You can then create a line style yourself, and use the Parse Columnar Data menu option to automatically parse the data fields on that line.
Parse Columnar w/Heading
Select this option if you want the Data Extractor to analyze a block of highlighted text, define the line style automatically, parse the lines based on the selected column separator, and allow you to specify the location of column headings whose names become the names of the data fields. For details about column separator options, see Source Options Window.
An example of columnar data with a heading: