Pervasive DataTools Online Help - Table of Contents

Pervasive DataTools User Guides

Table of Contents



Pervasive Software Inc.

12365 Riata Trace Parkway

Building B

Austin, TX 78727 USA

Web: http://pervasivedatatools.com



About These Manuals

These manuals lead you through the operation of the various Pervasive DataTools user interfaces. The DataTools are a set of subscription-based, licensed software products designed to assist you with a variety of data management, data conversion, data migration, and data transformation tasks. You must have a non-expired DataTools product license to run each of the applications.

Refer to the license.txt file in the default installation directory for disclaimers and information about trademarks and credits.

Table of Contents

The help topics are organized by the DataTools product categories; i.e., Loaders, Parsers, Extractors, etc. So whether you've already purchased a subscription or are still trying to determine how the tools work, peruse the help topics for any or all of the DataTools.

Each of the links below will take you to the Table of Contents for that product's complete online help.

Pervasive DataTools Launcher - Table of Contents

Pervasive DataTools

DataTools Launcher User’s Guide



Pervasive Software Inc.

12365 Riata Trace Parkway

Building B

Austin, TX 78727 USA

Web: http://pervasivedatatools.com



About This Manual

This manual leads you through the operation of the DataTools Launcher user interface. The Pervasive DataTools Launcher allows you to:

Refer to the license.txt file in the default installation directory for disclaimers and information about trademarks and credits.


Table of Contents


Getting Started with DataTools Launcher

The Active Tab

The Inactive Tab

Uninstall

Introduction to DataTools Launcher

DataTools Launcher is an application used to manage your Pervasive DataTools product licenses and to launch the variety of products available within the Pervasive DataTools suite.

Each DataTools license is a subscription to a specific product with a specific set of features.

DataTools Launcher consists of two tabs: one for managing active product licenses and one for viewing inactive licenses. Inactive licenses are either licenses that have expired or licenses that have been super-ceded. When you extend a product license, you receive a new license for the same product that has a new expiration date. This more recent, extended license supercedes the old one, and the old license file is archived and is moved to the Inactive tab.

Application Startup

DataTools Launcher looks for licenses within the Active License Directory. The default Active License Directory is located under the Windows shared application data folder. On Windows XP, the location is: C:\Documents and Settings\AllUsers\Application Data\Pervasive\DataTools\Common800\License. On Windows Vista (32-bit), the location is C:\ProgramData\Pervasive\DataTools\Common800\License. You can change the Active License Directory in the Preferences menu.

After installation and at first startup, the DataTools Launcher will prompt you to enter your user ID and password. Enter the same login credentials you used to order your license(s). The DataTools Launcher will then connect to the Pervasive license web service and download any new licenses or licenses missing from your Active License Directory. NOTE: It is important that you use the same user ID and password used to order your Pervasive DataTools licenses from the DataTools e-commerce web site.

After a successful log into the Pervasive web service, you will be notified if new or missing licenses exist. See below:

  • If you click No, DataTools Launcher will load and display all non-expired and non-superceded .slc files in the Active License Directory.
  • If you click Yes, you will be given the opportunity to add any of the new or missing license files to your Active License Directory.

Check the box in the Add cell next to the license you would like to copy to your Active License Directory. After selecting the licenses you wish to add to your Active License Directory, press the OK button. The licenses will be added and the Active tab will display.

Preferences

Under the preference menu, you have the ability to change the Active License Directory, the web service connection information, and the display font.

Changing the connection information includes changing the login information (User ID and Password) and changing some connection default behavior. Every time the application starts up, it will log into the Pervasive license web service to look for new or missing licenses. If you wish to turn this off, you can uncheck that preference. You can then select the Refresh command any time you want to look for new or missing licenses. You can also select to automatically download new or missing licenses any time they are found. The default is set to tell you how many were found and allow you to view and manually download the ones you want. Check the preference box to automatically download these if you want that behavior.

What Do You Want to Do?

For each task, we have provided a quick link to the section of the DataTools Launcher documentation that describes how to perform the specific task.

Task Help Topic
Find your DataTools licenses Go To Active Tab > Find
View licensed features
of an Active license
Go To Active Tab > View
Purchase additional licenses
or extend an existing license
Go To Active Tab > Purchase
Refresh the active license list Go To Active Tab > Refresh
Launch one of the DataTools
products
Go To Active Tab > Launch
View licensed features
of an Inactive license
Go To Inactive Tab > View
Uninstall DataTools Go To Uninstall DataTools

Introduction - Active Tab

The Active tab displays a list of the Pervasive DataTools licenses in your Active License Directory. DataTools licenses are files that have an .SLC file extension. First time users will be prompted for login information so that DataTools Launcher can download your active licenses from the Pervasive license web service. If the list is empty, it means no licenses exist under the login credentials provided. Use the Find command to locate and copy your license files into the Active License Directory if you have license files in other file storage locations.

To purchase additional licenses, use the Purchase command to access the e-commerce website where you can buy additional licenses or extend the subscription for a license that is about to expire.

The display grid on this tab and the Inactive tab shows some identifying information about each license.

Use the View command to view the license feature details.

All licenses that have expired will be automatically archived and display on the Expired tab. Licenses that are two weeks (or less) from expiration have a red expiration date.

Only one license within the display grid can be selected for further operations. Once a license is selected, you can use the Launch command to open the licensed product.

You can click any of the column headings within the grid to sort the grid by that column. The sorted column will display an up or down arrow depending on the sort sequence: Ascending or Descending.

View - Active Tab

Use the View command to open a selected license and view the licensed features. Any features that were licensed but have expired will have an expiration date in red. If the entire license has expired, all features that were licensed have an expiration date in red and the Licensing tab in the view is also red.

Find

Use the Find command to locate and add licenses to your Active License Directory. DataTools Launcher automatically logs into the Pervasive license web service at start up to download any new licenses, so the find command is for locating licenses you have on your local disk or network and copying these license files to your Active License Directory.

Adding licenses to the Active License Directory copies the license to the Active License Directory, leaving the original license in place. To search for licenses, type the path (including drive) you wish to search and press the Search button. You may cancel your search at any time if you already see the license you want to add in the found license list.

Select the licenses you wish to add, and click the Add button. As you add licenses the grid on the Active tab is updated to reflect these additions. Close the dialog when you are finished searching for and adding licenses.

Purchase

Use the Purchase command to access the Pervasive e-commerce site and purchase additional licenses or extend a license that is about to expire. After purchasing additional software, you’ll need to click the Refresh button to download your newly purchased or extended licenses.

See also: Refresh

Before the e-commerce site is launched, you will see the following message:

Note: No additional software installations are required after purchasing more licenses unless additional features have been added to the Pervasive DataTools product family since your last installation.

To extend a license, you simply go to the e-commerce site and purchase more time on a product you already own. You will be sent a new license file with the extended expiration date. You will need to download this new license into your Active License Directory. If you have 2 or more licenses for the same product in the Active License Directory, the one with the farthest out expiration date will be displayed and all others will be moved to the Inactive directory.

Refresh

Use the Refresh command to access the Pervasive license web service to check for new or missing license files. Your stored login userid and password will be used to see if new or missing license files exist. If any exist, you will be notified how many and be given the opportunity to add them to your Active License Directory.

If you click No, DataTools Launcher will load and display all non-expired .slc files in the Active License Directory. If you click Yes, you will be given the opportunity to add any of the new or missing files to your Active License Directory.

Check the box in the Add cell next to the license you would like to copy to your Active License Directory. After selecting the licenses you wish to add to your Active License Directory, press the OK button. The licenses will be added and the Active tab will display.

Launch

Use the Launch command to launch the graphical user interface application associated with the selected license. DataTools Launcher understands which application to launch based on the product name and serial number of the selected license. Each launched product has its own set of documentation. Besides documentation, there are samples and tutorials for learning how to use Data Tool applications. You are able to run only the applications licensed within the file selected within DataTools Launcher. If you are unsure as to which license you are using while in one of the launched applications, select Help > About within the launched application.

Note: It is recommended that you launch and work within only one application at a time. The suite of DataTools products uses a single copy of a configuration file (dj800.ini) needed for Pervasive products and engines to run. The license file being used is a part of this configuration file. Launching multiple applications could cause licensing errors.

Introduction - Inactive Tab

The Inactive tab displays a list of the Pervasive DataTools licenses that have expired or have been super-ceded. These licenses are in an archived state. DataTools Launcher archives expired and super-ceded licenses automatically at startup. Licenses in an archived state have an .archive file extension.

The display grid on this tab shows some identifying information about each license. Expired licenses have an expiration date in red. On this tab, you can choose to view license feature details. You can click any of the column headings within the grid to sort the grid by that column. The sorted column will display an up or down arrow depending on the sort sequence: ascending or descending.

View - Inactive Tab

Use the View command to open an archived license and view the licensed features. Any features that were licensed but have expired will have an expiration date in red. If the entire license has expired, all features that were licensed have an expiration date in red and the Licensing tab in the view is also red.

Uninstall DataTools

To uninstall Pervasive DataTools, open the Windows Control Panel. Go to Add/Remove Programs and double click on Pervasive DataTools to remove it.

Note: If you plan to upgrade your version of Pervasive DataTools, you must uninstall it first, then install the new version.

Pervasive Data Parser for Unstructured Text Online Help - Table of Contents

Pervasive DataTools

Data Parser for Unstructured Text User’s Guide



Pervasive Software Inc.

12365 Riata Trace Parkway

Building B

Austin, TX 78727 USA

Web: http://pervasivedatatools.com



About This Manual

This manual is currently a work in progress and therefore is incomplete. Documentation for the Data Parser for Unstructured Text is also available by clicking on the Help button on the right end of the button bar whenever the product is running.

This manual leads you through the operation of the Data Parser for Unstructured Text user interface. The Pervasive Data Parser for Unstructured Text allows you to extract useful data from report files and convert that data to a CSV Text file. You must have a non-expired Data Parser for Unstructured Text license to run this application.

Refer to the license.txt file in the default installation directory for disclaimers and information about trademarks and credits.

Table of Contents


Getting Started with Data Parser for Unstructured Text

Tutorials

Using Data Parser for Unstructured Text

All About Line Styles

All About Data Fields

Viewing the Extracted Data

Exporting the Extracted Data

Saving and Reusing Extract Scripts

Reference - User Interface

Appendix

Introduction to Data Parser for Unstructured Text

The Data Parser for Unstructured Text is a software product with the ability to read complex text files of many kinds. The amount of computer data grows vastly each year, and much of it is provided in raw text formats. Some examples of the many sources handled by the Data Parser for Unstructured Text follow:

  • Printouts from programs captured as disk files
  • Reports of any size or dimension
  • ASCII or any type of EBCDIC text files
  • Spooled print files
  • Fixed length sequential files
  • Complex multi-line files
  • Downloaded text files (e.g., news retrieval, financial, real estate)
  • HTML and other structured documents
  • Internet text downloads
  • Email header and body
  • Online textual databases
  • CD-ROM textbases
  • Files with tagged data fields
  • XML
  • HL7
  • Swift
  • And many others...

Using Data Parser for Unstructured Text, you can extract the desired data fields from various lines in the text file, and assemble those fields into a flat record of data. Thus, whole records of structured data can be extracted and presented in a conventional tabular (row and column) format that is needed before mapping and converting the data to a popular target format. Some of the features that make the Data Parser for Unstructured Text so complete are:

  • No practical limits on file size
  • Reads almost any kind of report architecture as long as there are rules
  • Support for large fields and records
  • Handles floating headers, footers and details
  • Can automatically detect and propose recognition patterns
  • Handles tagged data fields
  • Autoparses columnar and tagged data
  • Powerful debugging tools
  • Structured data browser to see results prior to export
  • Built on an extensible, extremely rich scripting language

The extraction of desired fields from the source text file is accomplished by visually marking up the file in the Data Parser for Unstructured Text user interface. The mouse is employed to select the desired fields from various lines displayed on the screen. Dialog boxes on the screen allow you to express a rich set of pattern recognition rules and actions to assist in the extraction of clean data.

Several techniques are available to view samples of extracted data. Apart from scrolling the full text of the data, a debug window can be used to search for all lines satisfying certain extraction criteria. For details, see **Debug Extract Design Window**. In addition, users can pop up a data browser that assembles all the fields and records in a grid format to give the user an idea of how the data will export. For details, see **Record Browser Window**.

Data Extraction Basics

The Data Extractor is a tool for extracting data that would otherwise be inaccessible.

Consider these scenarios:

  • Your company is attempting to migrate several years’ worth of data from a legacy application. The data files for this application are stored in an unknown proprietary format, possibly with compressed or encrypted fields. Although the data cannot be accessed directly, your legacy application can generate reports.
  • Your agency needs to merge data from several disparate sources into a single, easily accessible format. For example, you receive listings of real-estate properties from several different electronic sources that you want to combine into one standard listing format for your web site.
  • One of your clients needs to extract specific data from many large log files and aggregate that data into a database for statistical analysis.

In each of these scenarios, the Data Extractor can extract valuable data from standard formated text files with lots of irrelevant information, such as headers and comments.

The Data Extractor exports the extracted data to CSV (Comma Separated Values) Text file format. If you want to convert the data to another format, or you want to manipulate the data further after you have extracted it, the Data Loaders can accomplish this. The Data Loaders support over 100 different file types, allowing you to convert your data to the vast majority of databases used throughout the world.

To Use Data Extractor

First you need to have a report file. Most applications on nearly every type of platform give you the option of creating and printing reports. Have the program print the report in a text only format, either ASCII or any standard EBCDIC code page. For more information, see How to Create a Report File.

  1. Start a new script in the Data Extractor and select the report file.
  2. Look at the report in the Data Extractor.
  3. Notice the overall pattern of the report when it repeats, the page layout, and the style used to organize information. Locate the data that you want to extract.
  4. Input the structural information. The Data Extractor needs patterns and structural rules to identify important data.
    1. Define line styles by marking which lines have important information and how they can be recognized.
    2. Define data fields by marking the data that you want collected, and where it can be found.
    3. Specify line actions.
    4. While you are defining line styles and data fields, select options that specify how you want the data to be assembled into records and fields. The default action is to collect the fields. You must find the end of the first record, or the beginning of the second, and change the action for that line to Accept Record. This stops the collection process for the first record and begins the collection process for the second, thus setting exactly which fields are included in the eventual output for that record. If you want to define more than one type of record in a single report file, you can do that by defining more than one Accept Record line style.
    5. Assign the fields to each record type, according to how you want the data to be exported.
  5. Browse your data.
  6. Once you have entered all the information Data Extractor needs to find your data, and specified how you want it structured, the Data Extractor automatically builds that structure internally. You can open the data browser and see it in a grid. If the fields or records are not structured the way you want them, go back and adjust the data field and/or line style definitions.
  7. Finally, save the script.
  8. By saving your script, you can use it again if you need to extract data from a report with the same style in the future.

Additional details about each of these steps are described in this documentation.

Feature Segmentation

The following list presents some of the features available in the Data Extractor:

  • EBCDIC code page translation
  • Recognizes special characters and invisible characters
  • Multiple record Accepts
  • Mailing Label Template autoparse
  • Validates scripts automatically
  • View source data in external applications
  • Grid lines on Data Panel
  • Extract/Mine data from irregular text files
  • Auto New Line Style menu option
  • Auto New Data Field menu option

Some additional automatic menu options:

  • Parse Columnar Data
  • Parse Columnar Data w/ Heading
  • Parse on Field Separator
  • Parse Tagged Data
  • Parse Standards Data
  • Parse XML/HTML Data
  • Parse HL7 Data
  • Parse Swift Data
  • Parse LDIF Data
  • Parse EDI Data

About the Tutorials

Seven step-by-step tutorials are available to help you learn how to use the Data Extractor. We recommend that you complete the tutorials in the order in which they appear, as each tutorial builds upon the concepts covered in the previous tutorial.

Common Tasks

The Data Extractor tutorials all have several tasks in common. Those tasks are described here, and you may refer to them as needed.

  • Select Correct Tutorial File and Set Basic Options
  • Browse Data Records
  • Rearrange Data Fields
  • Save and Close Extract Design

Select Correct Tutorial File and Set Basic Options

Before starting each tutorial, you must select the matching tutorial file and set some basic options and file properties.

Use the following Data Extractor tutorial files with the tutorials:

  • Tutorial 1 - TUTOR1.REP
  • Tutorial 2 - TUTOR1.REP
  • Tutorial 3 - TUTOR3.REP
  • Tutorial 4 - TUTOR4.REP
  • Tutorial 5 - TUTOR5.REP
  • Tutorial 6 - TUTOR6.REP
  • Tutorial 7 - TUTOR7.TXT
To select a tutorial file and set basic options do the following:
  1. In the Data Extractor, click New Extract.
  2. In the Select the Text File window, navigate to the desired tutorial file in your default installation directory (Common800).
  3. Click Open. The report opens in the Data Extractor Data Panel.
  4. Open the Source Options window.
  5. In the Source Options window, select options that match the type and format of your text/report file.
  6. Close the Source Options window.
  7. Select Preferences from the menu and make sure "Close Definition Dialogs on Add/Update" is enabled.

Browse Data Records

Browse the data when you want to determine how your design choices have affected the data.

To browse the data records:
  1. Click Browse Data Record in the toolbar.
  2. If there is only one Accept record, a message window appears saying something similar to "Fields Assigned to Accept Record Category". If there are multiple Accept Records, you will be prompted to assign specific data fields to each Accept Record.
  3. Click OK.
  4. All of the Data Fields appear in the Data Browser window for you to preview your data in a tabular (row and column) format and to verify you have defined everything correctly. If you wish, you may rearrange the data fields.

Rearrange Data Fields

If the fields in the Data Browser window are not in the order you want them to appear in the export data file, change the export field order.

To rearrange data fields:
  1. Select Field > Export Field Layout from the menu.
  2. Click and drag a field name to the desired position. A special symbol displays while you are dragging.
  3. Reopen the Data Browser window to view your export fields in the order they will appear in the export data file.
  4. Once you are satisfied with the appearance of the data, save and close your extract script design.

Save and Close Extract Design

After you have completed your Extractor script, save and close it for later use.

To save your script and close Data Extractor:
  1. Save your script by clicking Save Extract in the toolbar. The Save Extract window appears.
  2. New extract scripts that have not been previously saved in Data Extractor display as "Extract: Extract1" in the title bar. Note: Notice that the extract file name defaults to your workspace; however, the extension has been changed to .cxl. This is consistent with standard naming for Data Extractor scripts. You can change the Extract File Name, but the extension must remain .cxl.
  3. Navigate to your default installation directory (Common 800) and name the extract script file (for example, Tutor1.cxl).
  4. Enter a description of the tutorial, if desired, and click OK.
  5. Click the Close Extract icon in the toolbar.
  6. Exit Data Extractor by selecting File > Exit.

The Data Extractor Tutorials

The following is a list and brief description of each of the Data Extractor tutorials.

Data Parser for Unstructured Text - Tutorial 1 - The Basics

Tutorial 1 guides you through the basic steps to create and save a script file in Data Parser for Unstructured Text. Later tutorials are more detailed.

This tutorial presents the fundamental concepts for using the Data Parser for Unstructured Text. It is recommended that you do this tutorial first. The example file is a tagged list, but the procedure is useful regardless of the type of report. The best way to use this tutorial is to print a hard copy so you can follow the sequential steps.

Tutorial Goals

In this tutorial, you will learn:

  • The basic process of creating an extract script
  • How to save the script design
  • New terms located throughout the documentation

Procedure

This tutorial is divided into three sections that should be completed in the order shown.

Define the Line Style - Accept Record

After selecting the tutorial file and setting up basic options, the first step in defining many extract scripts is to determine the line of data that marks the end of a record. In this case, the line with the string "Category:" is the last line of the first record.

After you identify the end of the record, define the line style for that line by marking the information that makes that line unique. In every record in this data file, the last line contains the string "Category:".

  1. Highlight the string "Category:" (including the colon following it).
  2. Right-click anywhere in the Data Panel, (the large white area of the screen) and select Define Line Style > New Line Style.
  3. The Line Style Definition window appears. Notice Data Parser has already formed line recognition rules based on the information you highlighted. It searches for all lines that contain the string "Category:" in columns 15 through 23.
  4. To indicate that Data Parser should accept the record at this point, ending one record and beginning the next, click the Line Action tab.
  5. Select ACCEPT Record.
  6. Click Add and proceed to Define the Line Style - Collect Fields.
  7. The line style name, Category, now appears in the Line Style Column (the yellow column on the left of your screen) to mark that line as matching the Category: Line Style pattern. A bold green arrow displays designating that this is the Accept Record line. Scroll down in the data panel and notice that each line that matches the pattern you defined was automatically marked with the "Category" Line Style.

Define the Line Style - Collect Fields

In the TUTOR1 file, the first line of text that contains pertinent data is the line with the report date "13-Jul-95" (10th line). The dashes (and their positions) in this line make it unique and are likely to remain consistent even if the date changes in later reports.

  1. Highlight the first dash.
  2. Right-click in the Data Panel and select Define Line Style > New Line Style.
  3. The Line Style Definition window appears. Notice that a pattern was created based on what you highlighted. Data Parser looks for any line that contains a dash in column 13.
  4. Type a more descriptive Line Style Name, such as "Report_Date".
  5. Click the Line Action tab and leave the option set to COLLECT Field Contents.
  6. The COLLECT Field Contents option causes any fields defined on this line to be included in the final output. COLLECT Field Contents is the action you want for the majority of the lines in this type of report.
  7. Click Add.
  8. Locate and highlight the string "Problem No:".
  9. Right-click in the Data Panel and select Define Line Style > New Line Style.
  10. The Line Style Definition window appears. Notice that Data Parser generated a Line Style recognition pattern based on the highlighted string "Problem No:". Data Parser also used the string "Problem_No:" to name the Line Style. You may rename the Line Style if you wish. Data Parser automatically selects COLLECT Field Contents as the line action. Since this is the option you want on most of the lines in this report, accept the default.
  11. Type a line style name.
  12. Click Add.
  13. Repeat steps 6 through 9 for each of the remaining lines in the first record.
  14. Remember, the "Category" Line Style has already been defined, and it is the Accept Record line.
  15. Proceed to Define Data Fields.

Define Data Fields

After defining line styles for 14 lines of the first record, define the Data Fields. You have given Data Parser the pattern information it needs to identify the lines in the report, now define what part of each line you consider to be useful data.

  1. Locate the line containing the date of the report.
  2. The line shows only the report date, so all of the text on that line is important.
  3. Highlight the entire date.
  4. The highlighted text is 1 row by 9 columns. The column and row numbers show at the bottom right part of the screen. Columns 11 through 19 on row 10 contain the date.
  5. Right-click in the Data Panel and select Define Data Field - New Data Field. The Field Definition window appears.
  6. Notice the Field Definition option is set to Fixed Column in both the Start Rule and End Rule tabs. The Data Field starts in column 11 and ends in column 19, exactly where you highlighted.
  7. The Field Name defaults to "Report_Date_1" indicating that this is the first field on the Report_Date line. Change the default name to a more descriptive name, Report_Date, by typing it in the Field Name box.
  8. Click Add.
  9. Define the remaining Data Fields:
    1. On the Problem_No line, highlight from column 25 to 30.
    2. This grabs enough space to include any larger numbers that might occur in later records.
    3. Right-click in the Data Panel and select Define Data Field > New Data Field.
    4. The default Field Name is Problem_No_1. Problem_No is a descriptive name, but there is only one field on this line so the "_1" is unnecessary.
    5. Click in the Field Name box and backspace twice to delete the number and underscore.
    6. Notice the Field Definition defaults to Fixed Column in both the Start Rule and End Rule tabs, starting in column 25 and ending in column 30.
    7. Click Add.
  10. Repeat step 6 for each remaining line of text on page 1 in TUTOR1.REP containing tagged data. See Table 3-2 below.
  11. Proceed to Browse Data Record in order to see how your data has changed.
  12. Rearrange Data Fields as needed.
  13. Save and Close your script.
Table 3-2: Tutorial 1 - Data Field Start and End Rules
Data Field Starting Column Ending Column
Report_Date 11 19
Problem_No 25 30
Techie 25 52
Status 25 52
MMDDYY 25 32
Time 25 32
Serial_No 25 39
Version 25 52
Customer_Name 25 52
Company_Name 25 52
Phone_No 25 52
Source_Type 25 52
Target_Type 25 52
Category 25 52

Data Parser for Unstructured Text - Tutorial 2 - Tagged Data and Automatic Features

Tutorial 2 guides you through the steps to create and save a script file using Data Parser’s automatic processes. The source file for this tutorial is the same tagged-list used in Data Parser for Unstructured Text Tutorial 1.

This tutorial introduces some of the useful timesaving features of Data Parser that read and flatten a data file that contains tagged data. It is useful to anyone ready to learn about more advanced Data Parser features. Tutorial 2 examines some quicker, more automatic ways to parse the same tagged-data used in Data Parser Tutorial 1.

Things to remember when defining Data Fields and Line Styles in tagged data:
  • When Data Parser automatically creates Data Fields, it uses the positions you have highlighted to determine the length of the Data Field. Be sure and allocate enough space for data in subsequent records that are wider than the text you are currently selecting. For example, the Techie Name in the first record is "John". In a subsequent record it could be "Alexander Graham Bell IV".
  • For tagged data, everything in the selection to the left of the Tag Separator is the Field Tag and everything to the right of the Tag Separator is the Data Field.
  • When a Line Style is created, it is not just for the line you are working on but also for any line that matches the Line Style definition. This means that when you create a Line Style that looks for "Techie:" in columns 17 to 24, and there is a Data Field defined for that Line Style in columns 26 to 55, all lines that have "Techie:" in columns 17 to 24 have a Data Field in columns 26 to 55.

Tutorial Goals

In this tutorial, you will learn:

  • How to create an extract script using automatic processes
  • How to save the extract design as a script file
  • New terms used throughout the Data Parser documentation

Procedure

These steps should be completed in the order shown.

Define Data Fields

After selecting the tutorial file and setting up basic options, the first step in defining most extract scripts is to determine the line of data that marks the end of a record. In the TUTOR1 data file, the line of text that contains "Category:" marks the end of each record.

  1. Highlight the line that contains the string "Category:", up to column 45. Check the indicator in the lower right corner of the screen for column locations.
  2. Right-click anywhere in the Data Panel (the large white area of the screen) and select Define Data Field > Parse Tagged Data.
  3. Note: Data Parser automatically defines a Line Style with the string "Category:" in columns 15 through 23 as the recognition pattern, and a Line Action of Collect Fields, and names it "Category". It also creates a Data Field that collects any data on that line beginning at column 24, one space after the colon, and going to column 45, and names it "Category". The field is now defined, and the text turns red on the screen.
  4. If you wish to check the Data Field definition, you can double-click on the field itself (the red text) and the Field Definition window opens. Make any necessary changes, then click Update.
  5. Proceed to Define the Line Style - Accept Record.

Define the Line Style - Accept Record

Since the Category Line Style is the last line of the record, the Line Action should be Accept Record. When Data Parser creates a line style automatically, it makes the line style Collect Fields, so the line action needs to be changed.

  1. Double-click on the Line Style Name, "Category" in this case, in the Line Style Column, the yellow column on the left part of your screen. The Line Style Definition window appears.
  2. Click on the Line Action tab and select ACCEPT Record [Including] This Line’s Fields from the list of choices.
  3. Click Update.
  4. View the data record by clicking on the Browse Data Record button in the button bar.
  5. Proceed to Adjust Data Field Definition.

Adjust Data Field Definition

  1. Select the entire Problem No line by left clicking on that line in the Line Style Column (the left yellow column).
  2. Right-click in the Data Panel (the large white part on the right) and select Define Data Field > Parse Tagged Data.
  3. The Line Style pattern that Data Parser automatically creates looks for Problem No: in positions 13 through 23.
  4. Double-click on Problem_No if you want to check it.
  5. Click Close to close the Line Style Definition window.
  6. To display the Field Definition window to view the information for the Problem_No: Data Field that was automatically generated, double-click anywhere in the Data Field where the text is red.
  7. Click the End Rule tab. Notice that the end rule is 52.
  8. This is larger than the Problem No: Data Field needs to be, because it is defining the size of the Data Field all the way to the right margin of the report.
  9. Change the end rule of the Problem_No: field to 30.
  10. Click Update.
  11. Notice the selected area on the Data panel for the Problem_No: Data Field is much smaller after the update.
  12. Proceed to Define the Header Information.

Define the Header Information

For this exercise, assume that the first line of the report contains information you want.

  1. Highlight the report name WINTECH on line 8 in positions 11 through 17.
  2. Right-click in the Data panel, and select Define Data Field > New Data Field. The Field Definition window appears.
  3. The default Data Field Name is highlighted. Since there is no tag on this line, Data Parser used the data itself as the Line Style name and Data Field name. Change the field name to ReportName by typing it in the Field Name box.
  4. Click Add.
  5. To define the report date Data Field, repeat steps 1 through 4, except highlight from columns 11 to 19 and name the field ReportDate.
  6. Proceeed to Update Line Style.

Update Line Style

The purpose of this exercise is to update the automatically generated "Jul95" Line Style to make it more generic for different report dates.

  1. To edit the "Jul95" Line Style, double-click on Jul95 in the Line Style Column.
  2. The Line Style Definition window diplays. Notice that the Pattern for this Line Style looks for 13-Jul-95 in columns 11 to 19.
  3. Size the cells in the grid to view the information better, by following these steps:
    1. Position the mouse over the line in the header row of the grid where the column headings are. The mouse pointer becomes a bold vertical bar with arrows pointing to the left and right.
    2. Hold down the mouse button and drag the edge of the column to the left or right.
    3. Release the mouse button when the column is the desired size.
    4. If desired, adjust the height the same way using the gray border to the left where the triangle and asterisk are located.
  4. To change the pattern to look for a line with any date with the dd-mmm-yy format, click once in the Look For? cell on the first row of the grid where 13-Jul-95 is currently displayed.
  5. A down arrow appears on the right side of that cell.
  6. Click on that arrow and the Pattern Builder window appears.
  7. TAB to the Value cell, delete the original value, and type a dash (-).
  8. Change the values of both the Begin and End cells to 13 by tabbing to them and typing in the correct number.
  9. Click OK.
  10. Notice that the Look For?, Begin, and End values have changed in the Line Style Definition window to reflect the changes made in the Pattern Builder window.
  11. Add a new row to the Line Style Definition grid by clicking in the And/Or cell in the second row. Accept the value default of And.
  12. Click in the Search What? cell of the second row and click the down arrow.
  13. Select Column Range (m-n) from the displayed list.
  14. Select Contains from the list displayed in the Operator cell of the second row.
  15. Click on the arrow in the Look For? cell of the second row to display the Pattern Builder window again.
  16. to the Value cell, delete the original value, and type in a dash (-).
  17. Change the Begin and End values to 17.
  18. Be careful to only enter a dash in the Value cell and do not leave any spaces around it.
  19. Click OK.
  20. The line style definition should now match any line with a dash in position 13 and 17.
  21. Click Update to save the changes to the ReportDate Line Style.
  22. Proceed to Define Remaining Data Fields and Line Styles.

Define Remaining Data Fields and Line Styles

In this exercise, you will Define Data Fields and Line Styles for the Techie, Status, MM/DD/YY, Time, Ser #, Version, Customer Name, Company Name, Phone #, Source Type, and Target Type Tagged Data Fields.

  1. Highlight the Field Tag, the Tag Separator, and the data by dragging the mouse with the left mouse button depressed from the beginning of the Tag to the end of the Data Field.
  2. Remember to extend out to the right to catch wider data in subsequent records.
  3. Right-click in the Data Panel and select Define Data Field > Parse Tagged Data.
  4. Data Parser creates a Line Style Definition and a Data Field Definition for you. OR
  5. Click the line in the Line Style column to select it.
  6. Select Parse Tagged Data.
  7. Open the Field Definition window.
  8. Adjust settings.
  9. Click Update and Close.
  10. Note:Data Parser named the MM/DD/YY, Ser #, and Phone # Data Fields and corresponding Line Styles MMDDYY, Ser, and Phone. Also, Data Fields with embedded spaces are named with the spaces removed. This was done because Field Names can only contain letters, digits and underscores. Scroll down in the Data panel and see how the rest of the data is being defined.

  11. Browse the data records to see how your data has changed.
  12. If desired, rearrange the data fields as needed to meet your export file requirements.
  13. Save and close your script.

Tip: This file can be parsed even more automatically. If you wish to try it, follow these steps:

  1. Click the Clear Line Styles icon in the button bar.
  2. Highlight all the tagged data lines in the entire first record, beginning with the Problem No line and highlighting all the way down and including the Category line.
  3. Be sure to catch all the field tags and data plus some extra space to the right.
  4. Right-click in the data panel and select Define Data Field > Parse Tagged Data.
  5. The Data Parser creates several new line styles and data fields at once. This method only works in cases of highly structured and consistent data. And it can be a great time saver when conditions are ideal.

Data Parser for Unstructured Text - Tutorial 3 - Columnar Data

Tutorial 3 guides you through the steps to create and save a script file in Data Parser for Unstructured Text that reads and flattens a report containing columnar data.

In Tutorial 3, you convert the data in a columnar report file, to a flattened format, using the more automatic features of Data Parser.

This tutorial introduces more of the time-saving features of Extract Schema Designer. Since a great many report formats contain columnar data of some kind, it is highly useful to anyone who wants to use Data Parser.

By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.

Unlike the previous tutorials, this file has multiple Accept Record line styles in a single page of the report. The primary data record information is in the table detail lines. Each line is essentially a record. Each of these is an Accept Record line.

Tutorial Goals

In this tutorial, you will learn:

  • How to create a script that reads and flattens a report with columnar data
  • How to use more automatic features of Data Parser
  • How to save the script file

Procedure

The following steps should be completed in the order shown:

Define Line Styles and Data Fields for Detail Lines

After selecting the tutorial file and setting up basic options, define line styles and data fields. Data Parser does the following when you complete this task:

  • Divides the line into seven Data Fields using spaces as a column separator. The Data Fields are given default field names SALES/ MARKETING_1 through SALES/MARKETING_7.
  • Creates a Line Style for the line. The Line Style that is automatically created has a default Line Name of SALESMARKETING. It identifies all lines in the report that have the string SALES/MARKETING in positions 1 through 16.
To define the line styles and data fields for detail lines:
  1. Select the first detail line (it begins with SALES/MARKETING) by clicking in the Line Style column (the narrow yellow stripe on the left) immediately to the left of the line.
  2. This highlights the entire line of text.
  3. Right-click in the Line Style Column (the yellow part of the screen on the left) and select Parse Columnar Data.
  4. From the menu, select Preferences and click once on Close Definition Dialogs on Add/Update to disable the option.
  5. To view the definitions of the Data Fields created, double-click on the colored sections of the line.
  6. For example, double-clicking on the green numbers 75,249 in the Data Panel brings up the SALES/MARKETING_2 Data Field in the Field Definition window. SALES/MARKETING_2 is the default Data Field name given to the second Data Field in the SALESMARKETING line. It starts in position 20 and ends in position 27. Since it is defined for the Line Style SALESMARKETING, only lines that match that recognition pattern contain this Data Field in positions 20 through 27.
  7. Proceed to Change Data Field Names.

Change Data Field Names

The Browse Data Record uses the Data Field names as column headings for the Data Fields, so it is a good idea to change the Data Field names for SALES/MARKETING_1 through SALES/MARKETING_7 to more descriptive field names.

To change Data Field names:
  1. Double-click on one of the Data Fields in the SALESMARKETING line to open the Field Definition window.
  2. In the Field Definition window, highlight the default Field Name and replace it with a corresponding descriptive name.
  3. See table 3-3 below.
  4. Click Update.
  5. To select the next Data Field, click the Field Name arrow to display a drop-down list of Data Fields that have been defined for the current Line Style.
  6. Select the next Data Field and continue until you have renamed all the fields. Close the Field Definition window when finished.
  7. Proceed to Change Line Style Name and Definition.
Table 3-3: Tutorial 3 - Suggested Data Field Names
Default Name Suggested Name
SALES/MARKETING_1 Department
SALES/MARKETING_2 Team1
SALES/MARKETING_3 Team2
SALES/MARKETING_4 Team3
SALES/MARKETING_5 Team4
SALES/MARKETING_6 Team5
SALES/MARKETING_7 DepartmentTotal

Change Line Style Name and Definition

To view the new Line Style SALESMARKETING, double-click on the name SALESMARKETING in the Line Style column (the yellow column on the left of your screen). The Line Style Definition window appears.

Notice the SALESMARKETING Line Style is recognized by a pattern where columns 1 to 16 contain the string SALES/MARKETING.

To change the Line Style Name and Line Action:
  1. In the Line Style Definition window, change the Line Style Name by highlighting SALESMARKETING in the Line Style Name box and replacing it with Detail.
  2. Also in the Line Style Definition window, click the Line Action tab and select the ACCEPT Record Including option.
  3. Click Update.
  4. Proceed to Define Line Recognition Rules.

Define Line Recognition Rules

The Detail Line Style only matches lines that have SALES/MARKETING in columns 2 through 16. That is the recognition pattern that Data Parser created automatically, but it is not the pattern that is needed in this case. The pattern needs to be general enough to match all of the detail lines in the text, but specific enough to match ONLY the detail lines. Update the Line Pattern so that the Line Style match all of the detail lines excluding the TEAM TOTALS line.

Analyze the detail lines to find what makes them unique in comparison to other lines in the text. Things to look for are position of the Data Fields, contents of the Data Fields, anything that is consistent for each of the detail lines but not contained in non-detail lines. For example, the detail lines contain:

  • Commas in positions 24, 34 and 75 on every line
  • Only letters, white space, and a "/" in columns 2 through 18
  • Only digits, white space, and commas in columns 20 through 79
  • A digit in position 78
  • An upper case letter in each of the first 5 positions

Of all of the above observations, creating a pattern to look for uppercase letters in the first five positions is the best way to go. Here are some reasons why:

  • Defining a pattern that would check for commas in positions 24, 34, and 75 requires three pattern lines and probably would not match every detail line in subsequent reports.
  • Suppose in this same report (created a week later) Team 2 of the Development department went to a pre-paid weeklong class and they only spent 100 dollars on supplies for the class. This means that a comma would not be in position 34 of that detail line so it would not match the Line Style, and the essential data on that line would be lost.
  • Defining a pattern to check for letters, white space, and a "/" in columns 2 through 18 would require three pattern lines and would also match the column heading line.
  • Defining a pattern to match lines that contains at least one digit in positions 20 through 79 and do not contain letters or "/" would require three pattern lines and it would match the detail lines. However, it also matches the Team Totals line.
  • Defining a pattern to match lines that contain a digit in position 79 would match the detail lines and the Team Totals line.
To define a pattern that looks for upper case letters in positions 2 through 6:
  1. Click the Line Recognition Rules tab in the Line Style Definition window.
  2. Click once in the Look For? cell in the first row of the grid and click the down arrow.
  3. The Pattern Builder window appears.
  4. In the Pattern Builder window, click in the Type cell and click the down arrow to display the allowable values for the Type field.
  5. Select character class from the list.
  6. This tells Extract Schema Designer what kind of data it needs to match for that line style to be valid.
  7. Tab to the Value cell and click the arrow to display the allowable values for the Value field.
  8. Select upper case letters from the list.
  9. This tells Data Parser the specific data it needs to match for that line style to be valid.
  10. Change the value in the Count cell to 5 by highlighting the value there and typing a 5.
  11. Change the value of the Begin field to 2 and the End field to 6.
  12. This tells Data Parser where to look for the data you specified and how many of that particular data must be found for the line style to match that line.
  13. Click OK.
  14. Click Update to save the modified line style definition.
  15. Proceed to Define Data Fields.

Define Data Fields

In this part of the exercise, you will define the rest of the data in the record, starting with the report title.

To define the ReportTitle Data Field:
  1. Select the report title ABC CORPORATION BUDGET on line 1 by highlighting it in the Data Panel.
  2. Right-click in the Data Panel and select Define Data Field > New Data Field.
  3. The Field Definition window appears.
  4. Change the default name to ReportTitle.
  5. Click Add.
  6. Data Parser takes the selected text and Data Field name to automatically define a Data Field named ReportTitle and a line style as well, named ABC_CORPORATION_BUDG.
  7. Click Close.
  8. Double-click on ABC_CORPORATION_BUDG in the Line Style Column to display the Line Style Definition window.
  9. Notice that Data Parser automatically creates a recognition pattern that looks for the literal ABC CORPORATION BUDGET in positions 27 through 48.
  10. In the LineStyleName field, type ReportTitle.
  11. Click Update and Close.
  12. Proceed to Define Line Styles.

Define Line Styles

  1. Select the report date 10/26/95 on line 2 by highlighting the text with the mouse.
  2. Right-click in the Data Panel and select Define Data Field > New Data Field.
  3. The Field Definition window appears.
  4. Change the default name to ReportDate.
  5. Click Add and then Close.
  6. Data Parser takes the selected text and enters Data Field name and automatically define a Data Field and a Line Style.
  7. Double-click Style1 in the Line Style Column to display the Line Style Definition window.
  8. Notice that Data Parser automatically created a recognition pattern that looks for the literal "/" in positions 35 and 38.
  9. Rename the Line Style to ReportDate.
  10. Click Update and Close.
  11. Browse the data Records to see how your data has changed.
  12. If desired, rearrange the data fields as needed to meet your export file layout requirements.
  13. Save and close your script.

Data Parser for Unstructured Text - Tutorial 4 - Floating Tags

Tutorial 4 guides you through the steps to create and save a script in Data Parser for Unstructured Text that reads and flattens a data file that containing floating tag data in a variable-length ASCII report.

This tutorial is useful to anyone likely to be working with floating tag data. By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.

Tutorial Goals

In this tutorial, you will learn:

  • How to create a script that reads and flattens an ASCII report with floating tag data
  • How to save the script file
  • New terms located throughout the documentation

Procedure

The steps in this tutorial should be completed in the order shown.

Define Line Styles

After selecting the tutorial file and setting up basic options, find the patterns in this file and build recognition patterns (Line Style Definitions) so that Data Parser can identify the lines with data.

The first characteristic of this report to notice is that each data record uses two lines of text. Another important characteristic is that several characters on each line are repeated consistently in the same position. These consistent patterns make it easy for you to build Line Style Definitions.

Data Parser automatically creates a Line Style using ATTDOC in columns 19 through 24 as the Recognition Pattern and ATTDOC as the Line Style Name when you complete this task. Each line of text that matches this Line Style now displays the Line Style Name ATTDOC and a green arrow in the Line Style Column to the left of the text line.

To define Line Styles:
  1. Highlight the letters TRN in columns 15 through 17.
  2. The letters TRN are in the same position in the first line of every record in the report. We could use the slash ( / ) in the third column or the colon ( : ) in the ninth column or any of several other consistent characters to identify the line, but the TRN is fine. Data Parser needs only one consistent characteristic to identify a line.
  3. Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Action - Collect Fields.
  4. In the second line of text, highlight the string ATTDOC in columns 19 through 24.
  5. These six letters appear in the same position in each second line of every record in the report.
  6. Right-click in the Data Panel, and select Define Line Style > Auto New Line Style > Action-Accept Record since this is the last line of every record.
  7. Proceed to Define Data Fields.

Define Data Fields

Notice that only the first couple of Data Fields in each of the TRN lines falls within the same columns from record to record. Define these fields first:

  1. In the TRN line, highlight the logged date and time data from columns 1 through 13.
  2. Right-click in the Data Panel, and select Define Data Field > New Data Field.
  3. In the Field Definition window, overwrite the default by typing Log in the Field Name box.
  4. Click Add.
  5. Highlight the string TRN from columns 15 through 17.
  6. Right-click and select Define Data Field .. New Data Field.
  7. The Field Definition window appears.
  8. Overwrite the default field name by typing Trans_Type in the box.
  9. Click Add.
  10. The TRN text changes to green all through the report indicating that it is the second data field defined on that line.
  11. Highlight the 12-digit number from columns 19 through 30.
  12. Right-click and select Define Data Field > New Data Field.
  13. The Field Definition window appears.
  14. In the Field Definition window, type Trans_No in the Field Name box.
  15. Click Add.
  16. The numeric text changes to blue in each of the TRN lines within the report indicating that it is the third field defined on that line.
  17. Proceed to Change Vertical Positioning.

Change Vertical Positioning

Because the patient and doctor names are different lengths, you cannot use Fixed Position to define the remainder of the Data Fields on the TRN lines. But because all of the fields other than the names have field labels with colons and spaces, Field Tags, you can define those fields as Floating Tag. "Floating" means that the Field Tags are not in the same position on the line in every record. If there were no Field Tags, you could still define the fields using Relative Word Position.

The fourth field starts in the same column in each of the TRN lines so you can define Start Rule as Fixed Position for this field.

To change the Vertical Positioning Bar:
  1. Click Vertical Positioning Bar.
  2. Click at the beginning of the field to confirm that the field does indeed start in the same position in every record.
  3. Click Vertical Position Bar again to remove the red line.
  4. The End Rule is Floating Tag because TIM:, the tag for the next field, always occurs at the end of this field.
  5. Proceed to Set Floating Tags - First Line of Text.

Set Floating Tags - First Line of Text

  1. Highlight the patient’s name from columns 32 through 47.
  2. Right-click and select Define Data Field > New Data Field.
  3. The Field Definition window appears.
  4. Type Patient in the Field Name box.
  5. Click the End Rule tab.
  6. Click on the Floating Tag option.
  7. Notice that the cursor is now blinking in the box to the right of the option.
  8. Type TIM: in the box.
  9. This tells Data Parser that this Data Field ends when the TIM: Field Tag is encountered.
  10. To prevent truncation, click the End Rule tab and set the Default FldLength to 30 bytes.
  11. Click Add.
  12. Notice that the patient’s name does not change to colored text in the report. Fields defined as Floating Tag or Relative Word Position do not appear in colored text, nor are they underlined even if you have Underline Fields enabled in the Preferences menu. This is because those field positions are not the same in all records.
  13. Highlight the date and time data from columns 54 through 71.
  14. Right-click and select Define Data Field > New Data Field.
  15. Type Date_Time in the Field Name box.
  16. At the Start Rule tab select the Floating Tag option.
  17. Type TIM: in the box.
  18. This tells Data Parser that this Data Field starts immediately after the string TIM:.
  19. Click the End Rule tab and select the Floating Tag option.
  20. Type TYP: in the box.
  21. This tells Data Parser that this Data Field ends when the TYP: Field Tag is encountered.
  22. Click Add.
  23. Repeat the task for all except the last field (RATE).
  24. Proceed to Set End of Line - First Line of Text.

Set End of Line - First Line of Text

The RATE field at the end of the TRN line starts with a Floating Tag, but ends at the end of the line of text. Define this Data Field accordingly.

  1. Highlight the rate data from columns 93 through 97.
  2. Right-click and select Define Data Field > New Data Field.
  3. Type Rate in the Field Name box.
  4. On the Start Rule tab, select the Floating Tag option. Type RATE: in the box.
  5. Click the End Rule tab and click the End of Line option.
  6. Click the Data Collection/Output tab and set the Default FldLength in bytes for the field.
  7. Click Add.
  8. Proceed to Set Floating Tags - Second Line of Text.

Set Floating Tags - Second Line of Text

Look at the ATTDOC line of text in the records. Notice that the Data Fields in this line are also Floating Tag data. Follow these steps to define all the Data Fields except the last field.

  1. Highlight the attending doctor number from columns 29 through 34.
  2. Right-click and select Define Data Field .. New Data Field.
  3. Type Attdoc_No in the Field Name box.
  4. On the Start Rule tab, select the Floating Tag option. Type ATTDOC NO: in the box.
  5. Click the End Rule tab.
  6. Click the Floating Tag option. Type ATTDOC: in the box.
  7. Click the Data Collection/Output tab and set the Default FldLength in bytes for the field.
  8. Click Add.
  9. Repeat the steps (using the appropriate field names and tags) for the remainder of the Data Fields on the ATTDOC line, except the last field (BY).
  10. Proceed to Set End of Line - Second Line of Text.

Set End of Line - Second Line of Text

The BY field at the end of the ATTDOC line starts with a Floating Tag, but ends at the end of the line of text just like the RATE field in the first line. So, use the same steps as before, except use End of Line as the End Rule for that field.

  1. Highlight the field.
  2. Right-click and select Define Data Field .. New Data Field.
  3. Name the field.
  4. Set the Start Rule.
  5. Click the End Rule tab and click the End of Line option.
  6. Click the Data Collection/Output tab and det the Default FldLength in bytes for the field.
  7. Click Add.
  8. Browse the data records to see how your data has changed.
  9. If desired, rearrange the data fields as needed to meet the requirements of the export data file.
  10. Save and close your script.

Data Parser for Unstructured Text - Tutorial 5 - Columnar Data with a Footer

Tutorial 5 guides you through the steps to create and save a script file in Data Parser for Unstructured Text that reads and flattens a data file containing both detail lines and a footer line with data to extract.

This tutorial is useful to anyone likely to be working with columnar data with footer. Before doing this tutorial, it is recommended that you do Data Parser Tutorial 3 - Columnar Data first.

By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.

Tutorial Goals

In this tutorial, you will learn:

  • How to create a script that reads and flattens a data file with detail and footer lines
  • How to save the script file
  • New terms located throughout the documentation

Procedure

The steps in this tutorial should be completed in the order shown.

The primary data record information is in the table detail lines. This data is highly structured in neat consistent columns. Data Parser can build recognition patterns for Line Styles and Data Fields with this type of data automatically, saving you a lot of time and effort.

Define Line Styles and Data Fields

After selecting the tutorial file and setting up basic options, define line styles and data fields.Data Parser automatically creates a Line Style for the line and gives it a default Line Name of SALESMARKETING when you complete this task. Data Parser also automatically parses the line into 7 Data Fields using spaces as a column separator. The Data Fields are given default names of SALESMARKETING_1 through SALESMARKETING_7.

To define line styles and data fields:
  1. Select the first detail line (it begins with SALES/MARKETING) by clicking in the Line Style column immediately to the left of the line to highlight the entire line of text.
  2. Right-click in the Line Style Column (the yellow stripe on the left part of the screen), and select Parse Columnar Data.
  3. Proceed to Change Data Field Names.

Change Data Field Names

Since the Data Field names are used in the Browse Data Record as column headings for the Data Fields, change the Data Field names for SALES/MARKETING_1 through SALES/MARKETING_7 to more descriptive field names.

See the new, and more descriptive, names for the Data Fields in Table 3-4 below.

Table 3-4 Tutorial 5 - Suggested Data Field Names
Default Name Suggested Name
SALES/MARKETING_1 Department
SALES/MARKETING_2 Team1
SALES/MARKETING_3 Team2
SALES/MARKETING_4 Team3
SALES/MARKETING_5 Team4
SALES/MARKETING_6 Team5
SALES/MARKETING_7 DepartmentTotal

To change Data Field names:
  1. In the Preferences menu, disable Close Definition Dialogs on Add/Update by unchecking it.
  2. Double-click on one of the Data Fields in the SALESMARKETING line to open the Field Definition window.
  3. In the Field Definition window, select the default field name, highlight it, and replace it with the corresponding descriptive name given above.
  4. Click Update.
  5. To select the next Data Field, click the Field Name arrow and a list of Data Fields that have been defined for the current Line Style is displayed. Select the next Data Field.
  6. Name the remaining Data Fields until you have named all the fields.
  7. Click Close.
  8. Double-click on the name, SALESMARKETING, in the Line Style column on the left of the screen.
  9. The Line Style Definition window appears.
  10. Type in a new name, Detail.
  11. Click Update.
  12. Proceed to Define Recognition Patterns.

Define Recognition Patterns

The SALESMARKETING Line Style is recognized by a pattern where columns 2 to 16 contain the text SALES/MARKETING. This pattern matches only the first detail line. It needs be general enough to match all of the detail lines in the text, but specific enough to match only the detail lines, not the TEAM TOTALS line.

Analyze the detail lines to find what makes them unique in comparison to other lines in the text. Things to look for are position of the Data Fields, contents of the Data Fields, anything that is consistent for each of the detail lines but not contained in non-detail lines. For example, the detail lines contain:

  • Commas in positions 24, 34 and 75 on every line.
  • Only letters, white space, and a / in columns 2 through 18.
  • Only digits, white space, and commas in columns 20 through 79.
  • A digit in position 78.
  • An upper case letter in each of the first 5 positions.

Of all of the above observations, creating a pattern to look for uppercase letters in the first 5 positions is the best way to go. Here are some reasons why:

  • Defining a pattern that checks for commas in positions 24, 34, and 75 would require 3 pattern lines and probably would not match every detail line in subsequent reports. Suppose in this same report (created a week later) Team 2 of the Development department went to a pre-paid weeklong class and they only spent 100 dollars on supplies for the class. This means that a comma would not be in position 34 of that detail line so it would not match the Line Style.
  • Defining a pattern to check for letters, white space, and a / in columns 2 through 18 would require three pattern lines and also matches the column heading line.
  • Defining a pattern to match lines that contains at least one digit in positions 20 through 79 and does not contain letters or / would require 3 pattern lines. It also matches the Team Totals line.
  • Defining a pattern to match lines that contain a digit in position 79 would match the detail lines and the Team Totals line.

So, the best pattern to use is one that looks for capital letters in columns 2 through 6.

To define a recognition pattern:
  1. In the Line Style Definition window, click once in the Look For? cell in the first row of the grid, then click the arrow to display the Pattern Builder window.
  2. Change the value of the Type field from literal to character class by clicking in the Type cell, then clicking the arrow to display the allowable values for the Type field. Select character class from the list.
  3. Click in the Value cell, then click the arrow to display the allowable values for the Value field. Select upper case letters from the list.
  4. Highlight the value in the Count field and change it to 5. The value of the Begin field should be 2. Change the value of the End field to 6 and click OK.
  5. Click in the empty cell in the seecond row of the And/Or column. The string And automatically displays in that cell.
  6. Click in the first empty cell in the Search What? Column. Then click on the down arrow and select Column Range (m-n) from the list.
  7. Click in the first empty cell in the Operator column. Then click on the down arrow and select Does Not Contain from the list.
  8. Click in the first empty cell in the Look For? column. Then click on the down arrow. This opens the Pattern Builder window.
  9. If the Value column is not empty, delete the contents of that cell. Press or place the mouse pointer in the cell in the Value column and click once to position the blinking cursor in that cell. Type a capital P.
  10. Change the values in the Count, Begin and End cells to 1.
  11. Click OK in the Pattern Builder window.
  12. Click the Update and Close in the Line Style Definition window.
  13. Notice that Detail now appears beside each of the detail lines in the Line Style column, and not next to the Processing Date line.
  14. Proceed to Modify Data Fields.

Modify Data Fields

To modify the Data Fields in the Detail lines so the Data Parser can extract the data on the last line of the report:
  1. Double-click in the Department Data Field, the red text at the beginning of each detail line.
  2. The Field Definition window opens.
  3. Click on the Data Collection/Output tab, and click on the Array Field option to enable it.
  4. Click Update.
  5. Click the Data Field Name arrow and choose the next Data Field.
  6. Repeat this process for each field in any one line of text.
  7. Click Close.
  8. Proceed to Define Line Style.

Define Line Style

To define the PROCESSING DATE line:
  1. Highlight PROCESSING DATE:.
  2. Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Accept Record.
  3. Data Parser defines the Line Style using the first word of the highlighted text in that position as the recognition pattern and named the Line Style PROCESSING.
  4. Proceed to Define Data Field.

Define Data Field

To define the Data Field on the PROCESSING_DATE line:
  1. Highlight the date from columns 17 through 24.
  2. Right-click in the Data Panel and select Define Data Field > New Data Field.
  3. When the Field Definition window opens, change the default Field Name to Date.
  4. Click Add in the Field Definition window.
  5. Browse the data Records to see how your data has changed.
  6. Rearrange the data fields as needed to meet the requirements of your export data file.
  7. Save and close your script.

Data Parser for Unstructured Text - Tutorial 6 - Variable Length Multi-Line Data Fields

Tutorial 6 guides you through the steps to create a script that reads and flattens a data file containing data that extends across multiple lines of text and where the end of each record varies.

This tutorial is useful to anyone who has a report with fields that extend across more than one line, or has no consistent end of record line.

By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.

Tutorial Goals

In this tutorial, you will learn:

  • How to create a script that reads and flattens a data file with varied record lengths
  • How to save the script file
  • New terms located throughout the documentation

Procedure

The steps in this tutorial should be completed in the order shown

Scroll through the data to get an idea of this file’s structure. Notice that there are eight or nine left-aligned Field Tags in each record. These tags can be used to easily identify and define the Line Styles in this report.

Define Line Styles

After selecting the tutorial file and setting up basic options, define your line styles.

  1. In the third line, highlight DATE: from columns 1 through column 5.
  2. Right-click with the mouse positioned anywhere in the Data Panel (the white part of the screen) and select Define Line Style > Auto New Line Style > Action-Collect fields.
  3. Data Parser automatically defines the Line Style with a Recognition Pattern of DATE: in columns 1 through 5 and name the Line Style DATE.
  4. Repeat the same basic procedure in step 2 for each of the following Field Tags in the first record. See Table 3-5 below.
  5. Table 3-5 Tutorial 6-Field Tag Columns
    Field Tag Beginning
    Column
    Ending
    Column
    RECORDATION 1 12
    CONSIDERATION 1 14
    SITE DIMENSIONS 1 16
    SITE AREA 1 10
    ZONING 1 7
    REMARKS 1 8
    Note: You may also, if you wish, go to the second record and define the LEGAL DESCRIPTION: field in columns 1-18.
  6. In the 23rd line of the report use the mouse to highlight UNIT PRICE: from column 1 through column 11.
  7. Right-click in the Data Panel and select Define Line Style > Auto New Line Style > Action-Accept Record.
  8. Data Parser automatically defines the Line Style with a Recognition Pattern of UNIT PRICE: in columns 1 through 11 and name the Line Style UNIT_PRICE. To verify that the Line Style Definitions match the appropriate lines of text throughout the report, scroll down and see that each of the lines that contain a Field Tag has the corresponding Line Style Name in the Line Style Column to the left of the text line.
  9. Proceed to Define Data Fields.

Define Data Fields

  1. Highlight November 12, 1993 from columns 26 through column 42.
  2. Right-click and select Define Data Field > New Data Field. The Field Definition window appears.
  3. The Field Name defaults to DATE_1.
  4. Type DATE to overwrite the default or click in the Field Name box and backspace over the _1.
  5. Click Add.
  6. Highlight Jeff County from columns 26 through 36.
  7. Right-click and select Define Data Field > New Data Field.
  8. The Field Name defaults to RECORDATION_1.
  9. Type in something else if you wish or click with the mouse and backspace twice to remove _1.
  10. Click Add.
  11. Proceed to Set Continuation Rule.

Set Continuation Rule

Notice that some of the data you want to extract resides within a single line of text in one record but continues across multiple lines of text in other records. For example, the data in the CONSIDERATION field in the first record is on a single line of text, but in the second record, the data in the CONSIDERATION field continues across nine lines of text. This is easily defined using the Data Parser feature called Continuation Rule.

  1. Highlight $333,000 Cash from columns 26 through 38.
  2. Right-click and select Define Data Field > New Data Field.
  3. Change the field name, if you wish.
  4. Click the End Rule tab and select the End of Line option.
  5. Click the Continuation Rule tab and select the Until Next Line Style option.
  6. There is one extra step necessary for fields that are not fixed in length, which is setting the Default FldLength to prevent data truncation.
  7. Set the Default FldLength to 500 bytes on either the End Rule tab or the Data Collection/Output tab.
  8. Click Add.
  9. Ensure that Data Parser is picking up all the data by clicking Browse Data Record, and widening the CONSIDERATION column.
  10. Define all of the remaining Data Fields.
  11. Browse the data records to see how your data has changed.
  12. Rearrange the data fields as needed to meet the requirements of the export data file.
  13. Save and close your script.

Data Parser for Unstructured Text - Tutorial 7 - Multiple Accept Records

Tutorial 7 guides you through the steps to create and save an extract script file in Data Parser that uses multiple Accept Records.

You parse the data in a report file, TUTOR7 (supplied during installation), into a format suitable for exporting. By following the steps outlined below, you become familiar with both the process of creating an extract script and the terms used throughout the documentation.

Tutorial Goals

In this tutorial, you will learn:

  • How to create a script that parses a report file into a multiple record file
  • How to use multiple Accept Records in your script
  • How to save the script file
  • Terms used throughout the documentation

Procedure

The steps in this tutorial should be completed in the order shown.

Define a Line Style and Data Fields

After selecting the tutorial file and setting up basic options, begin creating line styles for the first Accept Record.

To create a Line Style and Data Fields for these lines:
  1. Select the first detail line (Parmer Lane Animal Hospital) by clicking in the Line Style column immediately to the left of the line. This highlights the entire line of text.
  2. Right-click anywhere in the Line Style column and select New Line Style.
  3. The Line Style Name defaults to Parmer_Lane_Animal_H.
  4. Rename it HospitalLine.
  5. Click Add.
  6. Highlight Parmer Lane Animal Hospital in the Data Panel.
  7. Right-click in the Data Panel and select Define Data Field > New Data Field.
  8. The Field Name defaults to HospitalLine_1.
  9. Change the name to Hospital.
  10. Click Add.
  11. Right-click in the Line Style column to the left of April 1, 1999, and select New Line Style.
  12. Change the Line Style Name to ReportDateLine and click Add.
  13. The data on this line is always centered beneath the HospitalLine. Depending on what month and day the report is run on, the data field may be longer or shorter than the current date.
  14. To make sure that your data field is wide enough, highlight the data from positions 31 through 57.
  15. Right-click in the Data Panel and select Define Data Field > New Data Field.
  16. Change the field name to ReportDate.
  17. Click the Data Collection/Output tab.
  18. Make sure that the Trim Leading and Trailing Spaces box under Other Collection Options is checked.
  19. Click Add.
  20. The first repeating Line Style that we want Extract Schema Designer to find is the Account Number.
  21. Click in the Line Style column to the left of 1101-01, then right-click anywhere in the Line Style column and select New Line Style.
  22. The Line Style Name defaults to STYLE1. Change it to AccountLine.
  23. Proceed to Define Line Recognition Rules.

Define Line Recognition Rules

Notice the entry under the Look For? column. Its default is in position 5. While this catches all pertinent lines in our example, it might not catch all instances in a larger record example.

To update the Line Recognition Rules:
  1. Click in the first cell under Look For?, then click the arrow.
  2. In the Pattern Builder window, click in the first cell under Type and select Mask from the list.
  3. Click in the first cell under Value. Keep the hyphen and type a pound sign (#) for each numeral, for example ####-##.
  4. This tells Data Parser that there are four digits followed by a hyphen, and then two more digits.
  5. Change the Begin position from 5 to 1.
  6. Change the End position from 5 to 7.
  7. Click OK.
  8. Click Add.
  9. Highlight the account number (1101-01) and right-click in the Data Panel.
  10. Select Define Data Field > New Data Field.
  11. A Field Definition window appears.
  12. Change the Field Name from AccountLine_1 to AccountNo.
  13. Click Add.
  14. Before you continue, select Source Options from the tool bar and select the Flush Field Contents on Accept Default box under the Script Design Choices tab.
  15. This flushes the data from the remaining fields in your report, unless you manually change a specific field to propagate the data.
  16. Click OK.
  17. Proceed to Define Line Styles and Data Fields.

Define Line Styles and Data Fields

  1. Select the first detail line under the first account number.
  2. Right-click in the Line Style column to the left of Robertson and select New Line Style.
  3. A Line Style Definition window appears.
  4. Rename the Line Style from Robertson to LastNameLine.
  5. Click the Recognized By arrow and select Relative Position.
  6. Note: The information under Line Recognition Rules changes. The default Base Line is AccountLine.
  7. If it is not, click in the first cell under Base Line and select Account Line from the drop-down list.
  8. The default Line Count from Account Line is 1. However, there is a blank line between the AccountLine and the LastNameLine.
  9. Change the count from 1 to 2.
  10. Click Add.
  11. Highlight Robertson and continue out to position 35, in case someone further in the file has a very long last name.
  12. Right-click in the Data Panel and select Define Data Field > New Data Field.
  13. A Field Definition window appears.
  14. Change LastNameLine_1 to LastName and click Add.
  15. Right-click in the Line Style column to the left of Linda and select New Line Style.
  16. A Line Style Definition window appears.
  17. Change the Line Style name from Linda to FirstNameLine.
  18. Click the Recognize By arrow and select Relative Position.
  19. The Line Recognition Rules should default to three lines from AccountLine.
  20. If they do not, change the Count and the Base Line accordingly.
  21. Click Add.
  22. Now highlight Linda and continue out to position 35, in case someone further in the file has a very long first name.
  23. Right-click in the Data Panel and select Define Data Field > New Data Field.
  24. A Field Definition window appears.
  25. Change FirstNameLine_1 to FirstName.
  26. Click Add.
  27. Right-click in the Line Style column to the left of 143 Patterson Place and select New Line Style.
  28. A Data Parser Line Style Definition window appears.
  29. Change the Line Style name to Address1Line.
  30. Click the Recognize By arrow and select Relative Position.
  31. The Line Recognition Rules should default to 4 lines from AccountLine.
  32. If not, change the Count and the Base Line accordingly and click Add.
  33. Highlight 143 Patterson Place and continue out to position 60, in case someone further in the file has a very long address.
  34. Right-click in the Data Panel and select Define Data Field > New Data Field.
  35. A Field Definition window appears.
  36. Change Address1Line_1 to Address1 and click Add.
  37. Right-click in the Line Style column to the left of Austin TX 78759 and select New Line Style.
  38. In the Line Style Definition window, change the Line Style name to CSZ (for City, State, Zip).
  39. Click the Recognize By arrow and select Relative Position.
  40. The Line Recognition Rules should default to 5 lines from AccountLine.
  41. If not, change the Count and the Base Line accordingly and click Add.
  42. Proceed to Define the Line Style - Accept Record.

Define the Line Style - Accept Record

Since this is the end of the first record (before other information becomes a subset of this information), make this line an Accept Record.

  1. Click on the Line Action tab and select Accept Record Including and click Add.
  2. Right-click the Line Style column again and select Parse Columnar Data.
  3. Double-click the red Austin field.
  4. Rename the Field Name from CSZ_1 to City.
  5. Click Update.
  6. Double-click on the green TX field.
  7. In the Data Parser Field Definition window, rename the Field Name from CSZ_2 to State.
  8. Since this field is always be a two-character field, click on the End Rule tab and change the Fixed Column number to ZZ.
  9. Click Update.
  10. The State field now has only the two letters underlined in green.
  11. Double-click on the blue 78759 field.
  12. In the Field Definition window, rename the Field Name from CSZ_3 to Zip.
  13. Since this field is always start at position 40, change the Start Rule Fixed Column accordingly.
  14. Since this field may go out to position 49, change the End Rule Fixed Column accordingly.
  15. If you do not extend this field, the last four digits in the zip for the second record is not picked up.
  16. Click Update.
  17. View the data collected and rearrange data fields as needed.
  18. Proceed to Define Line Style for Pet Information.

Define Line Style for Pet Information

You must define a line style that recognizes each of the pet information lines. Analyze each of these records, to look for a common pattern to define the Line Style.

The Pet Type, Pet Name, Sex, and Color are all different in each record. The only thing that remains the same is the placement of Age, in positions 42 and 43.

To define your Line Style to recognize this pattern:
  1. Click in the Line Style column to the left of DSH-C.
  2. Right-click anywhere in the Line Style column and select New Line Style from the pop-up list.
  3. A Line Style Definition window appears.
  4. Change the Line Style Name from DSHC to PetInfoLine.
  5. Click in the first cell under Look For?.
  6. Click on the down arrow that appears to the right of that cell.
  7. The Pattern Builder window appears.
  8. Since you do not want to look for a specific character, change the Type from literal to character class.
  9. Click in the first cell under Value and select digits from the drop-down list.
  10. Since you can not be sure whether there are one or two digits for the Age, change the Count from 5 to 1.
  11. Change the Begin position from 3 to 42, the first possible Age position.
  12. Change the End position from 7 to 43, the last possible Age position and click OK.
  13. Since the PetInfoLine may have visit information under it, it is be an Accept Record as well.
  14. Click on the Line Action tab and select ACCEPT Record Including.
  15. Click Add. Then scroll through the file to see if Data Parser now recognizes all of the PetInfoLines.
  16. You should see a green arrow to the left of each of lines.
  17. With the PetInfoLine still highlighted, right-click in the Line Style column again and select Parse Columnar Data.
  18. Since the data fields default to PetInfoLine_1 >through PetInfoLine-5, go up to Preferences and de-select Close Definition Dialogs on Add/Update.
  19. Rename each of the fields more appropriately by clicking the Field Name arrow, and selecting each new data field.
  20. If you make a change to a data field, click Update before moving on to the next data field. Use the following new names: PetBreed, PetName, PetAge, PetsSex, and PetsColon. Note that the PetAge field does not extend far enough to catch the second digit of the age.
  21. Open the data field definition window and change the Start Rule to 42, and the End Rule to 43.
  22. Click Update.
  23. The field is now properly aligned.
  24. Similarly, the PetsSex field is always a single character. So open the data field definition window and change the End Rule to 58.
  25. Click Update.
  26. The field is now limited to a single character field.
  27. Open the data field definition window and change the End Rule to 90.
  28. Click Close.
  29. From the Preferences menu, enable Close Definition Dialogs on Add/Update.
  30. Since there may be office visit records for each pet, propagate the pet name so that it can be used with the visit date.
  31. Proceed to Change the Accept Record Behavior.

Change the Accept Record Behavior

To change the behavior on Accept:
  1. Double-click on Shiva to open the Field Definition window.
  2. Click the Data Collection/Output tab.
  3. Click Propagate Field Contents, then click Update.
  4. To view the data collected for this Accept Record, browse the data record.
  5. To select the Current Accept Record, click PetInfoLine in the middle of the screen.
  6. Click Assign to Current Accept Record.
  7. Click to the left of 02/18/99 on line 20 to define the VisitInfo line style.
  8. Right-click anywhere in the Line Style column and select New Line Style.
  9. Change the default NA Line Style Name to VisitInfoLine.
  10. Click on the Line Action tab and select ACCEPT Record Including.
  11. Click Add.
  12. With the VisitInfoLine still highlighted, right-click in the Line Style column again and select Parse Columnar Data.
  13. Since the data fields default to VisitInfoLine_1 through PetInfoLine-3, go to Preferences and disable Close Definition Dialogs on Add/Update.
  14. Double-click the first data field in the VisitInfoLine.
  15. Rename each of the fields appropriately, by clicking the Field Name arrow, and selecting each new data field.
  16. If you make a change to a data field, click Update before moving on to the next data field. Use the following new names: VisitDate, Diagnosis, and Service.
  17. Since the VisitDate field is always a fixed length, change the End Rule to 15.
  18. Change the End Rule on the Service field to 105 to be sure that it collects all the information in that field.
  19. Click Update and then Close.
  20. From Preferences, re-check Close Definition Dialogs on Add/Update.
  21. Assign the account number to each pet name, and associate the pet name and account number with each office visit by selecting Record from the menu, and selecting Edit Accept Record.
  22. Select PetInfoLine as the Current Accept Record. Then click Show Fields in the upper right portion of the Accept Record Definition window.
  23. Click the AccountLine check box.
  24. Click Update.
  25. Browse the data to see that the account number now shows up at the bottom of the PetInfoLine fields.
  26. Click Record in the top toolbar again, and select Edit Accept Record.
  27. Select VisitInfoLine as the Current Accept Record.
  28. Click Show Fields in the upper right portion of the Accept Record Definition window.
  29. Since you want to add the account number to this Accept Record, click the AccountLine check box.
  30. Add the pet’s name to this Accept Record by clicking the PetName check box.
  31. Click Update.
  32. Click on the VisitInfoLine in the middle of the screen to select the Current Accept Record.
  33. Click Assign to Current Accept Record.
  34. Scroll through the data to see that the visit information shows up under the pet name.
  35. Browse the data records to see how your data has changed.
  36. Rearrange the data fields as needed to meet the requirements of your export data file.
  37. Save and close your script.

Introduction to Basic Elements

When working with report or text files in the Data Extractor, there are three basic elements to help you accomplish your task. Those basic elements are line style, data field, and field content.

A line style is what you define that tells the Data Extractor how to identify a particular line of text in the report. You want to define each line style in such a way that the Data Extractor can identify that same line of text throughout the report. The trick to defining a good line style is to make the recognition rules specific enough to not include any lines you do not want recognized, yet broad enough to include all lines you do want recognized. See Defining Line Styles.

A data field is what you define that tells the Data Extractor which specific portions of the report you want to extract and assemble into data records. There are options in the Data Extractor that let you determine which data fields are collected and assembled as part of each output data record. For details, see Define Data Fields.

The field content is the data that occupies each defined data field. Any defined data field may contain data or may be blank in any given record. There are options in the Data Extractor that let you determine whether or not the field contents of one data field are carried forward to subsequent data fields that are blank. For details on the Flush Field Contents, Propagate Field Contents, and Flush Field Contents on Accept Default options, see Define Data Fields and Source Options Window.

When using Data Extractor to extract data from a text or report file, you follow some general procedures. It is important to understand that the goal is to define line styles and data fields in such a manner that the Data Extractor is able to assemble records of data out of the information contained in the report.

Some Helpful Tips

These sections offer some helpful hints and tips that may make your task easier. Since every report or text file is different, these subjects are more general in nature. There are more specific examples offered in other sections of the documentation.

Finding Logical Record Breaks

Your goal is to extract useful data out of your report or text file and then assemble that data into a field-and-record-oriented format. Therefore, one of your initial steps should be to examine the report and find the logical record breaks. When you locate a logical record break, you define a line of text as the ACCEPT Record to mark where the Data Extractor should stop collecting data fields and assemble a data record.

Some types of reports are formatted in such a way that logical record breaks are easy to locate and the ACCEPT Record is easy to define. Some examples follow:

  • Reports where each page contains the information that comprises one record, and the last line of text is defined as the ACCEPT Record. See Extract Schema Tutorial 1 - The Basics and Extract Schema Tutorial 2 - Tagged Data and Automatic Features.
  • Columnar reports where each line of text comprises one record, and each line is defined as the ACCEPT Record. See Extract Schema Tutorial 3 - Columnar Data.
  • Variable-length ASCII files where each record is derived from a consistent number of lines of text, and the last line of each record is defined as the ACCEPT Record. See Extract Schema Tutorial 4 - Floating Tags.

Other types of reports are not so easy. Some examples follow:

  • Reports that contain detail lines and a footer with data to be extracted. See Extract Schema Tutorial 5 - Columnar Data with a Footer.
  • Reports that contain data that extends across multiple lines of text within one data field. See Extract Schema Tutorial 6 - Variable Length Multi Line Data Fields.
  • Reports that contain data that fits into more than on logical record type. See Extract Schema Tutorial 7 - Multiple Accept Records.

When your report is formatted in such a way that defining the end of a record is difficult, sometimes the only way to handle this is to define the beginning of a record. You can use the ACCEPT Record option that tells the Data Extractor to assemble a record, but collect the data fields Before Collecting this line’s fields. This ends the field collection action on the last defined line of text that falls BEFORE the ACCEPT Record line, and places any data fields on the ACCEPT Record line in the next record.

Basic Steps

These are the basic steps required for creating an Extract script.

  1. Open a text file, report file, or URL file with the Data Extractor.
  2. Define line styles for each line in source file that contains information to be extracted.
  3. Within each defined line, you may define one or more data fields.
  4. After defining all the needed line styles and data fields, save the script.
  5. Export the extracted data.

The following sections explain the details of these procedures.

  • How to Create a Report File
  • Defining Line Styles
  • Defining Data Fields
  • Saving an Extract Script

Open a Text File or URI

To Create a New Extract

Follow these steps to open a text file, report file, or URI in the Data Extractor.

  1. Click the New Extract icon image\newscr.gifin the toolbar, or select New Extract from the File menu.
  2. In the Select the Text File window, choose your source file in one of three ways:
    1. Type the drive, directory path, and filename directly in the Source File text box and click OK.
    2. Click the arrow and browse to build the path and file name for the source file.
    3. Type the complete URI addressing scheme (e.g., http://www.yahoo.com/) in the Source File text box.

    When the file or URI is selected, the text displays in the Data Panel section of the Data Extractor main window.

    All file types that the Data Extractor can open appear in the Files of Type box in the Select Report/Text File window. Because the Data Extractor supports so many different types, not all extensions are visible within the Files of Type box. However, all available file types appear in the selection window, including those files with extensions not viewable in the Files of Type box.

    Note: A minimum of v1.4.0 of Sun Java Runtime must be installed for the URI support to work. Without this component, you will receive the error message: "Unable to load Java virtual machine."

    Note: Extract scripts (.CXL files) derived from a URI source connection are now stored in the default workspace directory called Extracts. For more information, see URI Support.

  3. Scroll through your file and locate a page or record that best represents the perfect record; i.e., one that is most representative of all the records. In the Source Options window, set the Sample Size to include this portion of the file. (See Source Options Window.)

Also, study the overall formatting of the information you want to extract, looking for any tagged or columnar data in the report. Pay special attention to the tag separators, field separators, and column separators. Again, in the Source Options Window, make appropriate selections from the list of available options.

You are now ready to begin defining line styles and data fields.

If you need to quit the Data Extractor and come back to your work at another time, save the extract. This also saves your work in a database file called extractor800.mdb in your \Common800 directory.

To open an existing extract with the same report:

If you have already defined some line styles and data fields in a report and saved that work, you can open the extract and report in another session and continue your work or make modifications to it.

If you have created and saved a complete extract, you can open the saved Extract and run an export.

To open an existing extract and report, double-click it with the mouse in the Extract Manager, or select it then select File > Open Extract.

Note: Extract scripts (.CXL files) derived from a URI source connection are stored in the default workspace directory called Extracts.

To open an existing extract with a different report:

You can open a previously-designed script with a different report or text file to check on script compatibility.

  1. Open the extract normally, either by double clicking on it, clicking Open Extract, or by selecting File > Open Extract.
  2. Select Source > Options.
  3. Click the File Properties tab.
  4. Click the Text File arrow and browse to choose the new report.

Extract Tuning Tips

Because the Data Extractor reads and evaluates each data line, anytime you have many Line Styles (usually more than five), consider these tips to speed up the performance of your extract scripts.

  • Change the order in which your Line Styles are checked. Position the most frequently found Line Styles at the top of your Line Style list. This way, "hits" that occur early on mark these lines and save the Data Extractor tons of comparison time. To change the Line Style order, select Line > ReOrder Line Styles.
  • Note: Do not alter the Line Style order when it is used to control which of multiple Line Styles could satisfy a line "hit". If you do, you might find your lines incorrectly marked.

  • Keeping in mind that every line is checked against the existing Line Styles, consider defining a REJECT Line Style that will "hit" lines you do NOT want (i.e., blank lines). If these lines appear frequently in your file, consider moving the REJECT line up near the top of your Line Style order. Doing this saves a considerable amount of comparison time.

Defining Line Styles

Each line in your text or report file that contains information to be extracted must be defined with a line style. In addition, other lines may need to be defined as reference lines. Each line style consists of a recognition rule and a line action.

A line style definition is what the Data Extractor uses to identify particular lines of information within your report. For each set of information that you want to assemble into a record of data there may be one or more line styles. The number of line styles needed is determined by how many lines of text in the report contain that set of information. For example:

  • If the report contains header information that includes a date that you want to include as a field within each data record, define a line style for the header line that contains the date.
  • If a detail line in the report contains information that you want to include as fields within each data record, define a line style for the detail line.
  • If the date from the header line and the information from the detail line make one complete data record, define only those two line styles in the report.

In most cases, it is best if you do not define any lines of text from which you are not extracting data. This keeps the script file smaller and more efficient. Lines for which there is no line style defined are ignored. The exception to this rule is that in some cases performance can be improved by defining certain large repetitive sections as Skip lines.

Note: If your source file contains tabs and you want to use the Auto Line Style feature, you must first update your tab expansion setting. To do this, select Source > Options from the menu. Then, select the Printer Emulation tab and change the Tab Expansion setting to 0.

Recognition Rules

The recognition rule portion of a line style definition contains criteria by which the Data Extractor identifies a line of text. In other words, you tell the Data Extractor how to recognize a line or lines of the report by defining a set of criteria. After you define a line style in one section of your report, the Data Extractor compares each line of text in your entire report file with that recognition rule. For each line of text that matches the recognition rule, the Data Extractor assigns that line style. The line style name displays in the Line Style column to the left of the Data Panel for each matching line of text in the report.

The trick to defining a good recognition rule is to make it specific enough to NOT include any lines you do NOT want recognized, and broad enough to include ALL the lines you DO want recognized.

You may define line styles manually or let the Data Extractor automatically define them, depending upon whether or not your data can be handled by the Data Extractor's automatic features. For details about this option, see File Menu and Pop-up Menus.

Note: You may want to utilize the more advanced features after becoming familiar with the basic procedures. Tutorial 1 will help you get acquainted with the basics of the Data Extractor, and Tutorials 2 and 3 will introduce you to some of the time saving advantages of the Advanced features.

If you are defining a line style manually, the Data Extractor suggests a recognition rule based on a pattern that displays in the Line Style Definition window when it opens. If you have highlighted a particular portion of the line, this portion is automatically suggested to create the recognition rule. You may modify the suggested recognition rule before adding it to the Data Extractor database and script. Details about different ways to define line styles are found in this documentation.

The manner in which you define line styles depends on a number of factors, including your own personal approach to a task. The other major factor is the type of text or report file with which you are working. The sections below should help you determine the best approach.

Remember the selections in the Source Options window should be examined and possibly modified prior to defining line styles. For details about the available options, see Source Options Window.

Recognized by

When you define a line style, you are specifying a recognition rule that the Data Extractor uses to identify any line of text in your file that matches that rule. Recognition rules are built in the Line Style Definition window. Each line style recognition rule is based on one of seven basic recognition styles. The available styles are described below.

Each style consists of an expression that specifies a search criterion. To see all the available options for building recognition rules, see Line Style Definition Window.

The following are some common, but brief, examples of where and how to use each of the seven basic Recognition styles:

Pattern

If you see that there is a unique string of text on one type of line that does not appear on any other lines, but always appears on that type of line, highlight that text and it becomes the pattern that the Data Extractor uses to identify that line in each record. In some cases, the unique string of text may even be a single character in a consistent position.

This is the most common style to use for a recognition rule. It offers the most flexibility, but can also be the most difficult to define. Patterns are defined by either single-row or multiple-row expressions. Recognition patterns are described in more detail later in this documentation.

Relative Position

If the line of text you are defining appears in the same relative position to some other line, a Base Line, you might define the recognition rule by its relative position to that other line. The Relative Position option allows you to specify one or more lines above or below the Base Line. The line of text you want to use as a Base Line must be defined before you define a line style that refers to it.

Exact Line Number

If the line of text you are defining appears on only one line of text in the report, you might define the recognition rule by its Exact Line Number. This option is very rarely used because it can only be used to recognize lines that occur only once in a report.

Blank Line

If you need to define the lines of the report that do not contain any text, you can define the recognition rule for them by Blank Line.

It is not necessary to define the blank lines in your report except when you need to use them as Base Lines, Accept lines, or markers of some kind.

All Undefined Lines

After you have defined all the necessary lines of text in your report, you may want to use this recognition rule to define all of the remaining undefined lines. Alternately, if all but a few of the lines of your report need to be recognized in the same way, you may want to define the lines you do not want, and then use All Undefined Lines to define the lines that you do want. Another reason you might want to define all undefined lines is for use in the Debug Extract Design Window.

The default line action for this option is COLLECT Fields, but you may select any Action that fits your needs.

Note: It is not necessary to define the undefined lines.

Pattern & Relative Position

Select this option when you want the Data Extractor to define a line style based on its relative position to another line AND by some specific characters or types of characters. Before selecting this option you must have already defined another line to use as your Base Line.

The recognition rule options for Pattern and for Relative Position, as well as special options for using both, are described in detail in this documentation.

The default line action for this option is COLLECT Fields, but you may select any Action that fits your needs.

Non-Blank Line

Select this option when you want the Data Extractor to use this line style to define all lines in the report that contain anything other than spaces and an end of line character.

There is no recognition rule for this option. Its behavior is automatic.

More About Line Styles

Once you have defined a line style and added it to the script, the line style name displays in the Line Style Column to the left of all lines that match that recognition rule.

If a line style name appears on any line that should not be included, select Edit Line Style and make the recognition rule more specific so the unwanted lines of text do not meet the recognition rule's criteria.

If a line style name does not appear on any line that should be included, select Edit Line Style and make the recognition rule broader so the needed lines of text match the recognition rule's criteria.

If the name of the line style you just defined does not appear on any line at all in the Line Style Column, then the recognition rule does not match any line in the report. Place the mouse in the Line Style Column of any blank line and double-click, or select Line > Edit Line Style. When the Line Style Definition window opens, select the line style name you want to modify from the Line Name list box, and edit the recognition rule so the expected lines of text meet the recognition rule's criteria.

Modify a Recognition Rule

If you need to modify the recognition rule of a line style, highlight any part of a line. Then select Line Style Column4 > Edit Line Style, or place the mouse pointer on a line style name in the Line Style Column and double-click. When the Line Style Definition window opens, modify the recognition rule and Update it.

How the Data Extractor Builds Recognition Patterns

The following sections describe how the Data Extractor builds recognition patterns.

New Line Style

When you select New Line Style from the menu, the Data Extractor creates a suggested line style recognition pattern that displays when the Line Style Definition window opens. For details, see Recognition Rules. If you highlighted a piece of the data on the line before making the selection, the Data Extractor uses the data you highlighted to form the recognition rule.

If the pattern meets your needs and the default line style name is acceptable, select a line action and then click Add to accept the recognition pattern. See Line Action. The Line Style Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.

If you need to change the pattern or want to change the line style name, you may enter a new name or modify the pattern, then select a line action, and click Add. The Line Style Definition window closes unless you have turned the Close Definition Dialogs on Add/Update OFF in the Preferences menu.

Auto New Line Style

When you select Auto New Line Style from the menu, another menu prompts you to select a line action. See Line Action. After you select a line action, the Data Extractor creates a line style recognition pattern automatically.

The default line style name displays in the Line Style Column for each line of text in your report file that matches the recognition pattern.

If you need to modify the line style recognition pattern or the line style name, double-click the line style name in the Line Style Column to open the Line Style Definition window and edit the line style recognition pattern and/or name. After making the desired modifications, click Update. The Line Style Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.

Suggested Approach - Defining Line Styles

After opening the report in the Data Extractor, scroll through your report file and choose a section of the file with which to work. Identify a section that is most representative of the entire file. Then open the Source Options window to specify the sample of the file with which you want to work. For details about how to set the sample size, see Source Options Window. You can move from one section of the report to another as you go along, if your report does not have any one section with all of the data you need to define.

Examine your text or report file before starting to define line styles to determine if some of the default options need to be changed in the Source Options window. For details about the available options, see Source Options Window. After making the necessary selections in the Source Options window, follow the steps below.

To define a line style:
  1. Highlight some selected text in the Data Panel that can be used as a line style recognition pattern.
  2. Right-click with the mouse positioned anywhere in the Data Panel (the large white area of the window).
  3. Select Define Line Style4New Line Style.
  4. From the Line Style Definition window, change the default line style name, if desired.
  5. Notice that the Data Extractor entered a suggested line style recognition pattern based on what you highlighted. Check to see if that pattern is acceptable.
  6. Click the Line Action tab and select a line action if you need an action other than COLLECT Fields.
  7. If everything is acceptable, click Add.
  8. The Line Style Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.
  9. When the Line Style Definition window closes, scroll through the report and verify that all of the matching lines in each section of the report now have been recognized with this line style.
  10. Edit the line style definition, if needed.
  11. Repeat the process until you have defined all the lines of text that contain the information you want to assemble into one data record.

In most cases, it is best if you do not define any lines of text from which you are not extracting data.

Tip: If you know that the pattern you have highlighted will work for this line style, select Auto New Line Style from the second level pop-up menu rather than simply New Line Style. This will cause a third level pop-up menu to appear with a list of line actions. Choose the line action you need for this type of line. The line style will be created without even opening the Line Style Definition window. You will see it appear in the Line Style Column next to every line with the highlighted pattern. For more information, see Understanding Line Style Behavior below.

Understanding Line Style Behavior

Each of the following sections describes methods you may use to highlight various types of data in your text or report file, and how the Data Extractor creates line style recognition rules based on what you highlighted.

When the Data Extractor searches a line of text to build a recognition rule, by default, it is based on a pattern. The Data Extractor searches for the following in the order listed here.

  1. Exact highlighted text, if less than entire line
  2. If entire line is highlighted:
    1. Field Tags
    2. Special Characters
    3. The First Field on the line of text

These are mutually exclusive, meaning if not field tags, then special characters; if not special characters, then first field, etc. The Data Extractor does not use field tags AND special characters AND first field.

Exact Highlighted Text

If you use the mouse to highlight one or more characters on a line of text, and select New Line Style or Auto New Line Style from the menu, the Data Extractor uses the highlighted text as the line style recognition pattern. This is usually a good way to begin defining a line style. Additions or modifications can be made in the Line Style Definition window. See Line Style Definition Window.

Highlight Selected Text in the Data Panel - Single Line

Highlight some selected text in the Data Panel, anything less than the total line. Zero in on the exact text, even down to one character, if necessary, that you want the Data Extractor to use as the recognition pattern. Select Define Line Style > Auto New Line Style, and then select a line action.

The Data Extractor uses the exact highlighted text as the recognition pattern for that line style and then assigns a unique line style name.

With the Append Line Pattern option at the second level pop-up menu, you can easily add to the recognition pattern. Simply highlight another piece of text that you want added to the recognition rules for that line and select Append Line Pattern from the menu. For details about Append Line Pattern, see Pop-up Menus.

Highlight an Entire Line of Text - Single Line

Highlight an entire line of text by clicking in the yellow Line Style Column to the left of the line. Select Define Line Style > Auto New Line Style, and then select a line action. The Data Extractor searches the entire line for:

  • A field tag, based on which tag separator was selected in the Source Options window.
  • Any of the special characters listed above.
  • The first field in the line of text, and uses whichever it finds first as the recognition pattern for that line style.

Field Tags

Many reports contain field tags, and the Data Extractor was developed to make use of these tags. Field tags may be used as line style recognition patterns or as field names for data fields.

Field tags are usually descriptive words that identify the data that follows the tag. In report files, there is usually a character that separates the field tag from its data, such as a colon or a hyphen. You may specify the tag separator on a line-by-line basis or for the whole report file in the Source Options Window.

A valid field tag must contain three words or less to the left of the tag separator. The Data Extractor finds field tags by searching everything to the left of the specified tag separator until it finds two spaces or the left margin of the report. When identifying field tags, the Data Extractor always uses a space to distinguish one word from the next.

Some examples of how the word "Name" may be used as a field tag:

  • Name:John M. Smith (a colon is the separator)
  • Name-John M. Smith (a hyphen/dash is the separator)
  • Name John M. Smith (2+ spaces is the separator)

Some examples of multi-word field tags with a colon and a space as the separator:

  • Name of Business: ABCD Corporation
  • First Name: Mary

Special Characters

For the purpose of extracting data out of text or report files, there are different types of printable information. For easier identification here, they are categorized into three groups, as follows:

Letters (A-Z and a-z)
Usually found in information such as a person's name, an address, a description of an item, etc.
Digits (0 - 9)
May be found in addresses, zip codes, phone numbers, currency amounts, dates, etc.
Special characters
Found in certain types of information that may be further categorized as specific formats of some kind, such as a date, a currency amount, or other specialized types of data. For this reason, the Data Extractor looks for these special characters when analyzing a line of text and tries to use them to build a recognition pattern.
These are the special characters that the Data Extractor searches for and where they might be found:
  • Period ( . ) - commonly found in numbers containing decimal places
  • Forward Slash ( / ) - commonly found in dates
  • Dash ( - ) - commonly found in dates, zip codes, and telephone numbers, but may also be a tag separator. For details on tag separators, see Source Options Window.
  • Colon ( : ) - commonly found in time data, but often is used as a tag separator. For details on tag separators, see Source Options Window.

First Field

First field is the first section of data on the line of text before the first fields separator (a space by default, but this can be reset in the Source Options window). More detailed information about how the Data Extractor automatically defines line styles can be found in the "How the Data Extractor Builds Recognition Patterns" section below. An explanation of each of the above search methods is described here.

Highlight Selected Text in the Data Panel - Multiple Lines

Highlighting a block of text in the Data Panel that includes more than one line of text is typically done with columnar reports, for detail lines within a report, or for lists of tagged data as in the TUTOR1.REP. Examples of each can be found in the Data Extractor tutorials.

Other Types of Lines

As you define the line styles in your report, you may find there are some lines of text that require special handling. This section examples those special kinds of lines and how you might define them with the Data Extractor. These include the following:

  • Blank Lines
  • Header Lines
  • Footer Lines

Blank Lines

The Data Extractor automatically and invisibly, defines all blank lines in a text or report file and sets the line action to REJECT Line. No line style name displays in the Line Style Column. This keeps the Line Style Column less cluttered and enhances the Data Extractor performance. By identifying and rejecting blank lines, the Data Extractor reads the lines that contain information more quickly.

If you need to use a blank line as a reference line or Accept line, define it appropriately.

Header Lines

When header lines are present in your report file, one or more of those lines might contain information you want to extract and assemble as part of a data record. For example, there may be a report title, report date, or account number that you want as a data field in each record in your target file. Treat these lines as you would any other ordinary line that contains data that you need in your output file. First, define a line style for any header line that contains data you want to extract. The line action is usually going to be COLLECT Fields. Then, in each of those lines, define each data field you want to extract.

Footer Lines

When footer lines are present in your report file, one or more of those lines might contain information that you want to extract and assemble as part of a data record. The Data Extractor can extract data out of footer lines that follow columnar data.

Define the line styles of any previous lines in the report. (Where you would normally select ACCEPT Record as the line action for the detail lines, select COLLECT Fields as the line action.) When you define the data fields within the detail lines, go to the Data Collection/Output tab in the Field Definition Window and turn Array Field ON for each field.

Proceed to the footer section and define the line or lines that contain the desired data. Select ACCEPT Record as the line action for the last defined line style. When you define the data fields on the footer lines, do not turn Array Field ON for these fields.

Line Style Names

When you are defining line styles, the Data Extractor assigns line style names as follows:

  1. If selected text is highlighted in the Data Panel, the Data Extractor uses the alpha characters of the highlighted text as the default line style name. If there are no alpha characters in the section of text that is highlighted, the Data Extractor names the line Style1, Style2, etc.
  2. If an entire line is highlighted and the Data Extractor finds a valid field tag on the line from which to build the recognition pattern, the default line style name is the same as the tag excluding the separator.
  3. If an entire line is highlighted and no field tag is found, the default line style name is created at design time as follows. Human navigation is completed to the desired web page (entering the requisite Passwords, etc.), and then a SaveAs to a local file is performed. That file can be brought in and processed using Data Extractor.

You may rename any line style in the Line Style Definition window, but it must meet requirements. Line style names must be unique and are limited to 20 characters. They may include upper and/or lower case letters (A - Z, a - z), numbers (0 - 9), and underscores ( _ ), but may not begin with a number. Line style names may not include spaces.

Line style names do not appear in the output file in any way. They exist in the Data Extractor only to help you distinguish between the different lines for which you have defined recognition rules and line actions.

Line Action

The second important part of a line style is the line action.

When defining a line style, you must select an Action that determines how the Data Extractor processes the lines of text or the data within the lines of text that matches the line style.

When you have made all the needed choices within the Line Style Definition Window, complete one of the following steps:

  1. Click Add if you are adding a new line style.
  2. Click Update after making modifications to an existing line style.

Defining Data Fields

Each data field that you define in the Data Extractor holds a particular piece of information that you want to extract from your report or text file. Each "set" of data fields that you define in the Data Extractor is exported as a record. A combination of the Line Style Definitions and Data Field Definitions determine how the data is extracted and assembled into data records.

This topic provides information about the various ways to define data fields using the Data Extractor.

Select the Define Data Field option after you have highlighted some text in the Data Panel and want to define one or more data fields. A second level menu appears. The information in each section below describes the menu options that are available in the second level menu.

If the data formats and/or techniques in this topic do not seem to fit your specific needs, please refer to Data Fields - Advanced Options.

Parse Columnar Data

Select the Parse Columnar Data option if you want the Data Extractor to analyze the highlighted text, define the line style automatically, and parse the line of text based on the selected column separator. For details about Column Separator options, see Source Options Window.

For example, if you have a line of text that looks something like the following, you might want the Data Extractor to parse it for you. First, select # of Spaces + as the column separator. Then select Parse Columnar Data from the menu.

09/15/95 Phone Bill Debit $120.45

The Data Extractor creates a line style that looks for the decimal point in the last column and the two slashes in the first column, and names it Phone Bill. It automatically parses this line into four data fields.

image\prscoldatfldex.gif

This option is only available if the highlighted line of text contains the column separator specified in Source Options between some text within the line. If there is no obvious pattern for the Data Extractor to automatically create a line style, a message appears letting you know that the Data Extractor cannot create the line style for you. You can then create a line style yourself, and use the Parse Columnar Data menu option to automatically parse the data fields on that line.

Parse Columnar w/Heading

Select this option if you want the Data Extractor to analyze a block of highlighted text, define the line style automatically, parse the lines based on the selected column separator, and allow you to specify the location of column headings whose names become the names of the data fields. For details about column separator options, see Source Options Window.

An example of columnar data with a heading:

Date Description TransType Amount
09/15/95 Phone Bill Debit $120.45

After this option is selected a dialog box appears asking how many lines the heading is, and how many lines to skip. The number of lines of data is automatically calculated from there. Click OK when the numbers are correct. This option is useful, even if you have several places in the report with this same pattern of columnar data. Only one section of data needs to be highlighted. Then all of the data with this pattern in the report is automatically defined.

Parse on Field Separator

Select this option if you want the Data Extractor to define fields in the highlighted area based on the column/field separator character.

Examples:

If the field separator is set to Space (the default value), Mark Jones Jr. would export with "Mark" in the first field, "Jones" in the second field and "Jr." in the third field.

If the field separator is set to Pipe ( | ), 123A|San Francisco|Mark Jones Jr. would export with "123A" in the first field, "San Francisco" in the second field, and "Mark Jones Jr." in the third field.

Note: Because fields parsed on field separators are generally not fixed length or fixed position, they do not change color on the Data Panel when defined.

For details about field separator options, see Source Options Window.

Parse Tagged Data

Choose this option if you want the Data Extractor to analyze the highlighted text and parse it based on the selected tag separator. For details about field separator options, see Source Options Window.

For example, if you have selected some text that looks something like the following, you might want the Data Extractor to parse it for you:

Telephone: 512-555-1212

The Data Extractor defines the line style using the text "Telephone:" in column 1 through 10 as the recognition pattern, assigns a line style name of "Telephone", defines one data field that contains "512-555-1212", and assigns a field name of "Telephone".

image\taggedex.gif

The Parse Tagged Data option is only available if the highlighted text contains the selected tag separator.

Parse HTML Tagged Data

Select this option if you want the Data Extractor to analyze the highlighted text and parse it based on the selected tag separator. For details about tag separator options, see Source Options Window.

HTML data is a platform independent text format commonly used on the Internet. Most people are familiar with its text basic layout. HTML files all have this basic building block, but vary widely in overall structure. The basic building block parsing can be done automatically in any HTML file.

Note: If Parse HTML Tagged Data is chosen with a selection over multiple lines, fields are parsed and defined only if the line was just added or there are no fields defined on the line. If Parse HTML Tagged Data is chosen with a selection in a single line, fields are always defined.

Auto New Data Field

Select this option to have the Data Extractor automatically and invisibly define a data field based on text that has been highlighted in the Data Panel. The Field Definition window does not open when this option is selected. The field name defaults to LINESTYLENAME_1, LINESTYLENAME_2, etc.

This option is only available if you have highlighted some text prior to right-clicking.

New Data Field

Select this option if you want to define a data field manually or view the work that the Data Extractor has done, based on the text you highlighted. It opens the Field Definition window where you may make selections and enter information to name or define the data field.

This option is available if you have highlighted either an entire line or some selected text within a line.

Edit Data Field

Choose this option if you want to view or modify a data field. It opens the Field Definition window where you may make selections and enter information to more specifically define the selected data field.

This option is only available if the cursor is positioned in a data field that has already been defined.

Delete Data Field

Choose this option to delete the selected data field definition. You will be prompted to confirm the deletion. The Delete Data Field option is only available if the cursor is positioned in a data field that has already been defined.

If the data formats and/or techniques in this topic do not seem to fit your specific needs, please refer to Data Fields - Advanced Options.

Data Fields - Advanced Options

The following sections describe how to define data fields when your report/text file contains more complex formatting than those described in the Defining Data Fields topic. To use the features described in this topic, select Define Data Field from the right-click menu in the Data Panel.


Tagged Data

Since tagged data occurs frequently in report files, we have included some special ways to handle it in the Data Extractor.

Tagged data can be considered any information in a report or text file where there is a field tag, or field name, followed by the data on one line of text. There is usually some kind of separator between the field tag and the data, such as a colon or a dash. In the Data Extractor, there is a list of tag separators from which to choose in the Source Options Window.

Tagged Data - Fixed Position

This section explains fixed-position tagged data where the field tag and its data occupy the same consistent horizontal position in a line from record to record within the report. Fixed position tagged data may appear in your text or report file in one of two ways:

  • Each field tag and its data may be on its own line (as in the Tutorial 1 file TUTOR1.REP).
  • There may be two or more field tags and their data on a line.

The procedures in this section are applicable to either of these two kinds of tagged data.

Note: It is not necessary to define the line style before using this procedure. The Data Extractor will use the field tag as a recognition pattern to create one automatically.

In the Data Panel, use the mouse to highlight a field tag, the field separator, and its data on a single line of text. Remember to extend the highlighted portion far enough to the right so wider data in subsequent records is picked up.

Example:

image\tagdata1.gif

With the mouse positioned anywhere in the Data Panel (the large white area), right-click. Select Define Data Field 4 Parse Tagged Data.

The Data Extractor first defines the line style, using the field tag as the recognition pattern and line style name. It then defines the data field by fixed position, starting with the column immediately after the selected tag separator and ending with the last highlighted column. The field tag is also used as the name for this data field.

Use the procedures above to define each piece of tagged data in one record. The matching lines and data fields in every other record of your report should automatically be defined. If they are not, you should edit the line style and/or the data field definition.

For a good example of how to work with tagged data using automatic features, see Data Extractor Tutorial 2 - Tagged Data and Automatic Features.

Tagged List Data - Fixed Position

This section deals with various report data formats that can be described as tagged list data. A tagged list consists of two or more lines of text that contain tagged data in a fixed position with one field tag and data field per line.

Tagged data can be considered any information in a report or text file where there is a field tag or field name followed by the data on one line of text. There is usually some kind of separator between the field tag and the data, such as a colon, a dash, or spaces. In the Data Extractor, there is a list of tag separators from which to choose in the Source Options Window.

Tagged list data may appear in your text or report file in one of three basic ways:

Each is described below along with the procedure for defining it in the Data Extractor.

Variable Inner Margins / Fixed Outer Margins

This type of tagged list contains field tags and data where the field tags are left justified and the data is right justified within the text file.

The first step in defining this type of tagged list is to go to the Script Design Choices tab of the Source Options Window and select the correct tag separator. For this type of format select # of Spaces+ from the Tag Separator list box. Another list box, the Separator Spaces box, displays below the Tag Separator box. Select With Data from this list box to indicate that the spaces between the tag and the data should be collected with the data. This causes the Data Extractor to extend the width of the data fields out to the left to within two spaces of the end of the longest field tag.

Here is an example of this type of a tagged list:

image\parstag3.gif

Notice that an entire block of text was highlighted. This highlighted block of data consists of one record of data.

After the block of data is highlighted, right-click in the Data Panel and select Define Data Field > Parse Tagged Data from the menu.

The Data Extractor defines each line style using the field tag found on each line of text within the highlighted block of text. Each line style is given a unique line style name corresponding to the field tag. The Data Extractor also defines each data field on each line, and the field names also corresponds to the field tag.

Variable Outer Margins / Fixed Inner Margins

This type of tagged list contains field tags and data where the field tags are right justified and the data is left justified within the text file.

The first step in defining this type of tagged list is to go to the Source Options Window and select the correct tag separator. For the example below, select ColonSpace(: ) from the Tag Separator list box.

Here is an example of this type of tagged list:

image\parstag2.gif

Notice that an entire block of text is highlighted. This highlighted block of data consists of one record of data.

After the block of data is highlighted, right-click in the Data Panel and select Define Data Field > Parse Tagged Data.

The Data Extractor defines each line style using the field tag found on each line of text within the highlighted block of text. Each line style is given a unique line style name corresponding to the field tag. The Data Extractor also defines each data field on each line, and the field names will correspond to the field tag.

For a good example of this kind of data with instructions to parse it, see Data Extractor Tutorial 2 - Tagged Data and Automatic Features.

Fixed Left Margin (Field Tags and Data)

This type of tagged list contains field tags and data where both the field tags and the data are left justified within the text file.

The first step in defining this type of tagged list is to go to the Source Options Window and select the correct tag separator. For this type of format select # of Spaces + from the tag separator list box. Another list box, the Separator Spaces box, will display below the Tag Separator box. Select With Tag from this list box to indicate that the spaces between the tag and the data should be collected with the tag. This will cause the Data Extractor to extend the width of the field tags out to the right to within two spaces of the beginning of the data fields.

Here is an example of this type of tagged list:

image\parstag1.gif

Notice that an entire block of text was highlighted. This highlighted block of data consists of one record of data.

After the block of data is highlighted, right click in the Data Panel and select Define Data Field 4 Parse Tagged Data.

The Data Extractor defines each line style using the field tag found on each line of text within the highlighted block of text. Each line style is given a unique line style name corresponding to the field tag. The Data Extractor also defines each data field on each line, and that the field names correspond to the field tag.


Tagged Data - Floating Position

This section deals with tagged data where the field tag and its data occupy different horizontal positions in a line from one record to the next within the report. In other words the tag and data "float" within the line of text.

Tagged data can be considered any information in a report or text file where there is a field tag or field name followed by the data on one line of text. There is usually some kind of separator between the field tag and the data, such as a colon or a dash.

The first step in defining this type of tagged data is to go to the Source Options Window and select the correct tag separator. For the example below, select ColonSpace(: ) from the Tag Separator list box.

Floating tagged data usually appears in your text or report file with multiple fields per line. It is the variable length of the data in these fields that causes them to float.

image\TAGDATA2.gif

Notice how the person’s name in the second record is longer than the name in the first record. This makes the end of that field a variable, not fixed, position. The tag for the third data field determines where the second data field ends. Also, the variable size of the second data field forces the third data field (Status) into a different position within the line. Thus, it has a variable beginning. The second data field on the line has a floating end tag, and the third data field has a floating start tag.

To define the AcctNo field:

You can use the Parse Tagged Data option from the right-click menu for the AcctNo: field in this report. But use the following procedure for the "Name:" and "Status:" fields.

To define the Name field:

  1. Highlight the text Name: Margaret Jones. (Choosing the longer data makes the export field length close to the right length. You might still have to adjust it slightly on the last tab. See step 8.)
  2. Right-click in the Data Panel and select Define Data Field 4 Parse Tagged Data. This creates a data field that begins two spaces past the "Name:" tag and name the data field "Name". The field becomes colored in the window, but the color is not in exactly the right place. It overlaps the "Status:" field tag in some records.
  3. Double-click the colored field. The Field Definition window opens. The start rule is Fixed Column. This is fine since this field begins in the same place in each record.
  4. Click the End Rule tab and select the Floating Tag radio button.
  5. Type Status: (case sensitive), including the colon, in the box to the right where the cursor is blinking. Since this field is not fixed in length, you need to set a maximum export length for it on the Data Collection/Output tab in the Export FldLength box.
  6. Click Add. The field returns to black. The Data Extractor cannot color data fields that are not fixed length and fixed position. Any field with a floating tag start or end rule remains in black text in the Data Panel.
  7. Highlight the text "Status: Closed" with the mouse and continue out to the end of the line.
  8. Right-click in the Data Panel and select Define Data Field4 Parse Tagged Data. This automatically creates a data field with the correct name, end rule, and export field length. The start rule is not correct. The text of the field changes color in the window.
  9. Double-click the colored text.
  10. At the Start Rule tab, select the Floating Tag radio button.
  11. Type Status: (case sensitive), including the colon, in the box.
  12. Click Add. The field text returns to black now that it is defined with a floating tag start rule.

It is common for fields to have both a floating tag start rule and end rule.

Note: When you use floating tags to define a data field, the text for those fields do not change to any color except black in the Data Panel. Floating Tag data fields also do not underline in the Data Panel even if you have that option turned ON. To verify the Data Extractor is reading the field contents correctly, open the Data Record Browser Window. To edit the field, highlight a piece of the line, right click and choose Define Data Field > Edit Data Field.

Columnar Data

Since columnar data occurs frequently in report files, we have included special ways to handle various types of columnar data in the Data Extractor.

Columnar data can be considered any information in a report or text file where there are two or more columns of data. It looks like a spreadsheet without the grid lines. There is usually some kind of separator between the columns, such as one or more spaces or a tab. In the Data Extractor, there is a list of column separators in the Source Options Window from which to choose. A single column of data is discussed below in the "Other Types of Data" section.

Columnar data often has column headings above the actual data. The Data Extractor allows you to use the column headings automatically as field names for the data fields. See the two Highlight Columnar Data sections below.

Columnar Data - No Heading - Single Line

In the Data Panel, use the mouse to highlight across two or more columns of data within a single line of text.

Example:

image\colsingl.gif

With the mouse positioned anywhere in the Data Panel, right-click. Select Define Data Field 4 Parse Columnar Data. One of two things happens:

  • If the line of text was previously defined with a line style, the Data Extractor defines each column as a fixed position data field with fixed column start and end rules and leaves one or two spaces between each of the data fields. The number of spaces left between the data fields depends on which column separator is selected in the Source Options window. Field names default to LineStyleName_1, LineStyleName_2, etc.>/li>
  • If the line of text was not previously defined with a line style, the Data Extractor defines the line style, using either field tags, or special characters, or first field as the recognition pattern. The Data Extractor then defines each column as a fixed position data field with fixed column start and end rules and leave one or two spaces between each of the data fields. The number of spaces left between the data fields depends upon which Column Separator is selected in the Source Options window. Field names default to LineStyleName_1, LineStyleName_2, etc. If nothing is found with which to define the line style, no field tag, special character, or first field, the Data Extractor returns a message indicating this. You need to define a line style recognition pattern manually. Then you can follow these steps again to define the fields automatically.

Note: Columnar fields are automatically set to Flush Field Contents.


Columnar Data - No Heading - Multiple Lines

To highlight a specific block of the text, place the mouse pointer immediately to the left of the first character of data to be included in the highlighted selection. Click and drag to the right until the last character of data is included in the highlighted selection. All lines of text between the two corners are highlighted.

Example:

image\colwohdr.gif

In this example, please note that the upper left corner of the block starts immediately to the left of the word "SALES/MARKETING", so the blank space left of the "S" is not a part of the highlighted text. The lower right corner of the block ends immediately after an empty data field in the last column.

Procedure

  1. In the Data Panel, use the mouse to highlight across two or more columns and lines of data. Do not include other data that is not columnar.
  2. With the mouse positioned anywhere in the Data Panel, right-click and select Define Data Field4Parse Columnar Data. One of four things happens:
    1. If all the highlighted lines of text were previously defined with the same line style, (named Sales for instance) the Data Extractor defines each column on each line as a data field using Fixed Column as the field definition. Field names default to Sales_1, Sales_2, etc.
    2. If the highlighted lines of text were not previously defined with any line style, the Data Extractor tries to define the line style, using only special characters found in the same columns in every highlighted line as the recognition pattern. If a recognition pattern is built, a line style is defined with the first field on the first line used as the line style name. The Data Extractor then defines each column on each line as a data field using fixed column as the field definition. Field names default to SALESMARKETING_1, SALESMARKETING_2, etc.
    3. If the highlighted lines of text were not previously defined with any line style, the Data Extractor tries to define the line style, using only special characters found in the same columns in every highlighted line as the recognition pattern. If no special characters are found, the Data Extractor cannot build a recognition pattern. A message box appears informing you that the Data Extractor "Couldn't add line definition, no recognition pattern found". If you get this message, must manually define a line style that recognizes all the detail lines. For details, see Defining Line Styles. After the line style has been defined, you can use the Parse Columnar Data option to define the fields.
    4. If the highlighted lines of text are defined with different line styles on various lines of the highlighted text, attempt to define every line with the same line style before selecting the Parse Columnar Data option. Otherwise, the Data Extractor creates a new line style (if it can) and adds the data fields to it. In this case, you must use the ReOrder Line Styles option to position the Data Extractor-created line style ahead of the previously defined line styles.

Note: Columnar fields are automatically set to Flush Field Contents.


Columnar Data - With Heading - Single Line

To highlight a specific block of the text, place the mouse pointer immediately to the left of the first character of data to be included in the highlighted selection. Click and drag to the right until the last character of data is included in the highlighted selection. All lines of text between the two corners are highlighted. image\colhsing.gif

Procedure

  1. In the Data Panel, highlight across two or more columns of data within a single line of text, and also include any column headings that may appear above the data.
  2. With the mouse positioned anywhere in the Data Panel, right-click and select Define Data Field 4Parse Columnar w/Heading.
  3. Type the number of lines occupied by the column heading and a number of lines for the Data Extractor to skip.
    1. If there is a line of dashes, asterisks, spaces, or other non-alpha characters immediately below the column heading, do not include this line when counting the Header Lines. Type the total number of heading lines in the box to the right of Header Lines.
    2. If there is other information or a blank line below the column heading that is not part of the column heading such as a line of dashes, asterisks, spaces or other non-alpha characters, count these lines, including blank lines and alpha lines, and type the total number in the box to the right of SKIP Lines. The Data Extractor calculates the number of data lines.
  4. Check to make sure it is correct, and then click OK in the Number of Header Lines dialog box. At this point, one of two things happens:
    1. If the line of text was previously defined with a line style, the Data Extractor defines each column as a fixed position data field with fixed column start and end rules. Each data field defaults with a field name similar to the column heading. For more details, see Field Names.
    2. If the line of columnar text was not previously defined with a line style, the Data Extractor defines the line style, using either special characters or first field as the recognition pattern, and then defines each column as a fixed position data field with fixed column start and end rules. Each data field defaults to a field name similar to the column heading. For more details, see Field Names.

Columnar Data - With Heading - Multiple Lines

To highlight a specific block of the text, including headers, place the mouse pointer immediately to the left of the first character of a header to be included in the highlighted selection. Click and drag the down and to the right until the last character of data in a detail line is included in the highlighted selection. It is not necessary to highlight all of the detail lines that will be parsed, but you want to be sure to highlight far enough to the right so data in the last column is not truncated in lines you may not be able to see in the display area. All lines of text between the two corners are highlighted.

Example:

image\colwhdr.gif

Procedure

  1. In the Data Panel, highlight across two or more columns and lines of data, starting with a column heading in the upper left corner and ending with a detail line in the lower right.
  2. With the mouse positioned anywhere in the Data Panel, right-click and select Define Data Field > Parse Columnar w/Heading.
  3. Type the number of lines occupied by the column heading and a number of lines for the Data Extractor to skip:
    1. If there is a line of dashes, asterisks, or other non-alpha characters immediately below the column heading, do not include this line when counting the header lines. Type the total number of heading lines in the box to the right of Header Lines.
    2. If there is other information below the column heading that is not part of the data or the column heading such as a line of dashes, asterisks, or other non-alpha characters, include those when counting the skip lines, count these lines, including blank lines and alpha lines, and type the total number in the box to the right of SKIP Lines.
  4. The Data Extractor calculates the number of data lines. Check to see that it is correct. Click OK in the Number of Header Lines dialog box. At this point, one of four things happens:
    1. If all the highlighted lines of text were previously defined with the same line style, the Data Extractor defines each column as a fixed position data field with fixed column start and end rules, and on each of the highlighted lines of text. Each data field defaults to a field name similar to the column heading. For more details, see Field Names.
    2. If the highlighted lines of text were not previously defined with any line style, the Data Extractor tries to define the line style, using only special characters found in the same columns in every highlighted line as the recognition pattern. If a recognition pattern is built, the Data Extractor then defines each column as a fixed position data field with fixed column start and end rules on each of the highlighted lines and any other lines in the report that meet the line style recognition rules. Each data field defaults to a field name similar to the column heading. For more details, see Field Names.
    3. If the highlighted lines of text were not previously defined with any line style, the Data Extractor tries to define the line style, using only special characters found in the same columns in every highlighted line as the recognition pattern. If no special characters are found, the Data Extractor cannot build a recognition pattern. A message box appears informing you that the Data Extractor could not find anything by which to define the line styles. If you get this message, manually define the line styles. For details, see Defining Line Styles.
    4. If the highlighted lines of text are defined with different line styles on various lines of text, try to define every line with the same line style before selecting the Parse Columnar Data option. Otherwise, the Data Extractor creates a new line style, if it can, and adds the data fields to it. In this case, you must use the Re-Order Line Styles option to position the Data Extractor-created line style ahead of the previously defined line styles or delete the old line styles.

For a good example of how to work with columnar data with the automatic features, see Data Extractor Tutorial 5 - Columnar Data with a Footer.


Other Data Formats

If the data in your report or text file is not formatted as tagged or columnar data, or if you need to extract data out of header and/or footer lines, read the information in this section.

Selected Text - Single Line - With Continuation

For reports that contain data in a paragraph or other long multi-line format, it is necessary to define a continuation rule. This causes all the data to be gathered in a single data field.

The steps in this section are based on the assumption that you have already defined the line style for the line of text on which you now want to define data fields. For information on how to define line styles, see Defining Line Styles.

Define a line style that recognizes the first line of the data only. Highlight the data just on that one line.

image\tutor6example.gif

Procedure

  1. Highlight the data on the first line, the one youdefined with a line style.
  2. With the mouse positioned anywhere in the Data Panel, right-click and select Define Data Field4New Data Field.
  3. Give the data field a meaningful field name, if desired.
  4. Fixed Column is the default selection at the Start Rule and End Rule tab. Change these values, if necessary.
  5. Click the Continuation tab, and select the continuation rule that is appropriate to your report format. Until Next Line Style is probably the most appropriate for the example above since the Remarks field could continue an indefinite number of lines and it is not known if the Unit Price line style will always be the line style following it.
  6. Since this field will not be fixed length, on the Data Collection/Output tab, set the Output FldLength for export purposes.
  7. Click Add. The Field Definition window closes unless you have turned Close Definition Dialogs on Add/Update OFF in the Preferences menu.

For details about each Continuation option, see Field Definition Window.

The field is not fixed length. The first line is fixed length and position and is colored and underlined. The rest of the field is not.


Header Lines

When header lines of text are present in your report file, some of them may contain information you want to extract and define as a data field in each record. For example, there may be a report title or report date that you want as a data field in each record in your target file.

You should first define a line style for any header line that contains data you want to extract. The COLLECT Fields line action is usually what is needed. Then in each of those lines, define each data field you want to extract. Continue defining the rest of the lines and fields.

You might also want to reposition the data fields prior to exporting the data. Open the Export Field Order Window by selecting Field > Export Field Layout and moving the data fields to the desired position.


Footer Lines

When footer lines of text are present in your report file, some of them may contain information you want to extract and define as a data field in each record.

If your data does not contain detail lines that would normally be ACCEPT Record lines, make the footer the ACCEPT Record line and proceed normally.

If the previous lines are detail lines, select COLLECT Fields as the line action rather than ACCEPT Record, as usual. For each data field within the detail lines, go to the Data Collection tab in the Field Definition Window and turn Array Field ON for each field.

Proceed to the footer section and define the line or lines that contain the desired data. Select ACCEPT Record as the line action for the last defined line style. When you define the data fields on the footer lines, do not turn Array Field ON for these fields. For an example of arrayed fields with a footer line, see Data Extractor Tutorial 5 - Columnar Data with a Footer.

You may want to reposition the data fields prior to exporting the data. Open the Export Field Order Window and move the data fields to the desired position.

Internal Data Browser

The Internal Data Browser allows you to view the extracted data in a tabular (row and column) format. This helps you know if your extraction rules are working as expected, and gives you an idea of how the data will look when exported to a CSV text file.

image\srcbrzic.gif

To open the Internal Data Browser, click the Browse Data Record button in the Tool Bar, or select Browse Records from the Tools menu. You must have done at least ONE of EACH of the following prior to opening the Record Browser window:

  • Defined one line style
  • Defined one data field
  • Set one line style as the ACCEPT Record
  • Assigned at least one data field to that ACCEPT Line

When the Internal Data Browser window opens it will look something like this:

image\browse3mixed.gif

This window shows a preview of the results of the data extraction rules you have defined in your script. It is not an actual file yet. You must run the export in order to create an output file.

Navigating in the Data Browser

You may view the entire extracted contents of your source data file, not just the part that displays in the Data Panel, by using the vertical and horizontal scroll bars to the right and below the data display. The entire window can be moved, sized, or maximized.

Tip: We suggest you do not use PAGE UP, PAGE DOWN, CTRL+END, or CTRL+HOME to move around in the Internal Data Browser. Due to the potential size of the data, it would be best if you use the arrow keys on your keyboard, or click the arrows in the horizontal and vertical scroll bars, thus moving only one column or record at a time. Also note that scrolling down is generally faster than scrolling up in the Browser.

The name of your source report/text file, including drive and directory path, is displayed above the data display in the window's title bar.

The field names you assigned in the Field Definition window will display as column headings above each column of data. If you did not assign field names to your data fields, the column headings will default to LineStyleName_1, LineStyleName_2, etc. The columns can be sized by dragging the right margin of the column in the header box to the right or left.

Here are descriptions of some of the other features of the Internal Data Browser:

Rec No.

To the left of each record (row) is a record number. This is not part of the actual data, but is a method of identifying records. If the cursor is positioned within a record (row) in the browser, the record (row) number will display in this box at the bottom of the window.

Display

This box allows you to see your extracted data in several different formats. Formatted Text, Masked Text, Hex values, and Hex/Text mix, which shows printable characters normally and hex values of non-printing characters. For more information on hexadecimal character representation, see Hex and Decimal Values.

To close the Internal Data Browser, click the X in the upper right hand corner.

Single Record Type

If you have defined only one ACCEPT Record, each row in the Browser will contain one data record that the Data Extractor has extracted and assembled from your report or text file, based on the line style and data field definitions you have created. This is where you can examine the records to determine if you have defined the line styles and data fields correctly to achieve the results you desire. This is also where you can view the order in which the data fields will be exported to your output data file. If you would like to change the field order prior to exporting the data, close the browser and choose Export Field Layout from the Field menu.

Multiple Record Types

If you have defined more than one ACCEPT Record, a multi-record type browser will appear. This browser will show the data differently, with the first column indicating which record type is displayed. The ACCEPT Record line style name will be used to label the various record layouts. The data field name and the contents of the fields will appear in subsequent columns.

Exporting the Data

After successfully defining the data extraction rules in the Data Extractor, the next step is to export the data to a file on your workstation or any network drive/directory to which you have access. This help topic describes the export procedures and options.

The Data Extractor offers two export file types, as follows:

  • ASCII (Delimited) - also called CSV Text
  • Unicode (Delimited) - CSV Text with support for international character sets, including double-byte, multibyte, UTF-8, UTF-16, and many more.

Export Procedure and Options

The data export features are available in the Export window. To open the Export window click on the Run Extractor button in the button bar. The following steps and options are found on five tabs in this window.

Connect Info

On the Connect Info tab, select the output file format and the drive, directory, and filename of the output data file.

  1. In the Export File Type picklist, select the desired format of the output file.
  2. In the Target File/URI box, type or navigate to the drive and directory path to which you want the output file to be written, and enter a filename.
  3. Click the Connect button to the right of the Target File/URI box.

Note: The Run Export button at the bottom of this window will remain grayed out until you have entered and connected to the Target File/URI. However, you should (at a minimum) go to the Properties and Records tabs and make any needed selections prior to exporting the data.


Sample Extraction (Optional)

On the Sample Extraction tab, you may specify a subset of the extracted data records to be exported. This is convenient if you need a random sample of the records, or in cases where you simply want to export a few records for testing. The default selection is All Records. To select one of the other options, click the radio button to the left of the desired option. The options are:

A Range of Records Select this option when you want to specify the starting and ending record numbers to export. Then enter the desired starting and ending record numbers in their respective boxes.
Every Nth Record Select this option when you want to select a random sample of records to export. Then enter the desired value in the "N" box.

Filter Extraction (Optional)

On the Filter Extraction tab, you may specify a subset of the extracted data records to be exported, based on record-filtering criteria. This is convenient if you need to export only the data records that contain specific information in a data field. The options are:

Source Field From the picklist, select the source data field that contains the information on which you want to filter the data records.
Operator From the picklist, select the operator for your filter. The options are: < (less than), <= (less than or equal to), = (equal to), <> (not equal to), >= (greater than or equal to), and > (greater than).
Value In this box, enter the information on which you want to filter the data records. Remember, that the selected operator may determine exactly what you enter here.

Properties

The options on the Properties tab are properties that you may need to specify for your output data file. After you have selected all the correct choices from the list of options below, click the Apply button.

Delimited ASCII data has special characters between fields and records. Another name for delimited ASCII files is CSV (comma separated variable) text.

When delimited ASCII is your export file format, the default field separators are commas (,) and the field delimiters are quotation marks ("). The default record separator is a carriage return-line feed (CR-LF). If different separators and/or delimiters are needed, you can specify them by selecting from the options in the Properties tab and choosing the desired separators and delimiters for your target file. When you are creating a delimited ASCII file that will be imported into another application, you should set each of the Properties options as required by application before exporting the data.

CodePage

The selected CodePage Translation Table determines which code table to use when writing out the target data. The default is ANSI, which is the standard in the US. You may select any of the other code page options from the picklist.

RecordSeparator

A delimited ASCII file is presumed to have a carriage return-line feed (CR-LF) between records. To specify a different a record separator for your output data file, click in the RecordSeparator cell and click once. Then click the down arrow to the right of the box and click the desired record separator in the list box. The list box choices are: carriage return-line feed (default), line feed, carriage return, line feed-carriage return, form feed, empty line, ctrl-E and no record separator. If you have or need an alternate record separator other than one from the list box, you can type it here.

  • If the record separator is not one of the choices from the list box and is a printable character, highlight the CR-LF and then type the correct character. For example, if the separator is a pipe ( | ), type a pipe from the keyboard.
  • If the record separator is not one of the choices from the list box and is not a printable character, highlight the CR-LF and then enter a backslash ( \ ), an 'X' and then the hex value for the separator. For example, if the separator is a check mark, type \XFB. For a list of the 256 standard and extended ASCII characters, see Hex Values Reference Chart in the Data Parser User's Guide.
FieldSeparator

Delimited ASCII files are presumed to have a comma between each field. To specify a different field separator, click in the FieldSeparator Current Value box and click once. Then click the down arrow to the right of the box to display the list of options. The list box options are: comma (default), tab, space, carriage return-line feed, line feed, carriage return, line feed-carriage return, ctrl-R, a pipe (|) and no field separator. If you need an alternate field separator other than one from the list box, you can type it here.

  • If the field separator is not one of the choices from the list box and is a printable character, highlight the CR-LF and then type the correct character. For example, if the separator is an asterisk (*), type an asterisk from the keyboard.
  • If the field separator is not one of the choices from the list box and is not a printable character, highlight the CR-LF and then enter a backslash (\), an 'X' and the hex value for the separator. For example, if the separator is a check mark type \XFB.
FieldStartDelimiter

Delimited ASCII files are presumed to have start-of-field and end-of-field delimiters. The default delimiter is a quotation mark ( " ) because it is the most common. However, some files do not contain field delimiters, so this option is available for you to choose. To create an output data file with no delimiters, click in the FieldStartDelimiter Current Value box and click once. Then click the down arrow to the right of the box and select None.

FieldEndDelimiter

Delimited ASCII files are presumed to have beginning-of-field and end-of-field delimiters. The default delimiter is a quote ( " ) because it is the most common. However, some files do not contain field delimiters, so this option is available for you to choose, if needed. To create an output file with no delimiters, click in the FieldEndDelimiter Current Value box and click once. Then click the down arrow and select None.

Header

This option uses the field names you created in your Data Field Definitions to automatically create a header record in your output CSV data file. To create a header record, click in the Header Current Value box, click once and then click True. The default setting is False.

Note: If you are appending data to an existing file (Output Mode is Append), leave the setting for Header as False.

Field Delimit Style

This option determines whether the specified FieldStartDelimiter and FieldEndDelimiter is used for all fields, only fields containing a separator, or only text fields, as follows:

All Places the delimiters specified in FieldStartDelimiter and FieldEndDelimiter before and after every field. "All" is the default setting. For example: "Smith","12345","Houston"
Partial Places the specified delimiters before and after fields only where necessary. A field that contains a character that is the same as the field separator would have the field delimiters placed around it. A common occurrence of this is where there is a "memo" field that contains quotes within the text of data. For example: "Customer responded with "No thank you" to my offer".
Text Places delimiters before and after Text and Name fields (non-numeric fields). Numeric and Date fields have no FieldStartDelimiter or FieldEndDelimiter. For example: "Smith", 12345,"Houston", 11/13/04
NonNumeric Places delimiters before and after all non-numeric types, such as Date fields. An important difference between Non-Numeric and Text: Non-Numeric delimits Date fields, while Text does not delimit Date fields.
StripLeadingBlanks

For an ASCII Target file, leading blanks are stripped by default. If you want to leave the leading blanks, click in the StripLeadingBlanks Current Value box and click once. Then click the down arrow to the right of the box and click False.

StripTrailingBlanks

For an ASCII Target file, trailing blanks are stripped by default. If you want to leave the trailing blanks, click in the StripTrailingBlanks Current Value box and click once. Then click the down arrow to the right of the box and click False.

TransliterationIn

Allows you to specify a character, or a set of characters, to be filtered out of the Source data. For any character in Transliterate In, the corresponding character from the Transliterate Out property is substituted. If there is no corresponding character, the Source character is filtered out completely. Transliterate In supports C-style escape sequences such as: \n (new line), \r (carriage return) and \t (tab).

TransliterationOut

Allows you to specify a character to be substituted for another character from the Source data. For any character in Transliterate In, the corresponding character from the Transliterate Out property is substituted. If you wish the Source character to be filtered out completely, leave this field blank. If there are no characters to be transliterated, this field should be left blank. Transliterate Out supports C-style escape sequences such as: \n (new line), \r (carriage return) and \t (tab).

MaximumDataLen

This option allows you to specify the maximum number of characters to write to a field. If this value is set to 0 (the default), the number of characters that are written to a field are determined by the field length. If you set this value to any value OTHER THAN zero, data may be truncated.

NullIndicator

This option allows you to specify whether "blank" target data fields will contain an empty string (nothing) or a Null character (00). The default for most CSV text files is None.


Record

On the Record tab, the picklist contains the name(s) of one or more record types, depending on the number of ACCEPT records you defined when defining the data extraction rules, as follows:

  • If you defined only one line style as the ACCEPT record, the picklist will contain only one selection.
  • If you defined two or more ACCEPT records, the picklist will contain the names of each of the ACCEPT records. You may export only one type of record for each export, therefore you must perform multiple exports if you have defined multiple record types in your data extraction rules. See "Tip" below.

Tip: If your source report/text file contains parent and child records for which you need to retain the relationships, you should include a key field in each of the record types (ACCEPT record line styles) when specifying the fields to include in each. Using this methodology, your task of joining data across multiple files is possible.


Run Export Button

At the bottom of the Export window, click on the Run Export button after making all the desired selections in the tabs. Your output data file will be written to the drive, directory path, and filename you specified on the Connect Info tab.

How to Save an Extract Script

After you have defined your data extraction rules (line styles and data fields definitions), save your extract script for reuse.

Before saving an extract script, notice that new extract scripts show as "Extract: Extract1" on the title bar of the main window.

image\Savescr.gif

You might want to save your script for a variety of reasons, including:

  • You may have a complex report file that requires more than one session for script designing.
  • You might receive a similar report file in the future, in which case you could open this one and modify it, rather than starting from the beginning.

How to Save an Extract Script

To save an extract script, click Save Extract in the toolbar or select File > Save Extract in the menu. Both of the options display the Save Extract window.

Save your extract script in the same path as your report, and have it share the name of your report, with a CXL extension. This is consistent with the standard naming for Data Extractor scripts. The Data Extractor automatically saves your file with a .CXL extension.

The Author name is defaulted for you but can be changed.

Enter a description if desired and click OK. Your new extract script is saved.

Alternately, use the Save Extract As option to move an extract script file to a new location or save it as just a name, without a path.

How to ReUse a Saved Extract Script

You can open a previously-designed script with a different report or text file as follows:

  1. Open the saved extract by double clicking on it in the grid that displays your saved extracts when you first open the Data Extractor.
  2. Select Source > Options from the menu.
  3. Click the File Properties tab.
  4. Click the Text File down arrow and browse to choose the new report file.

Tip: If you modify the extract script to work with the new report/text file, you may want to use the Save As option to save the modified script with a new script filename, rather than overwriting the original script.

Tool Bar Buttons

There is a toolbar located just below the main menu toward the top of the Extract Schema Designer window. The buttons in the toolbar provide quick access to the most commonly used commands and windows.

Each of the buttons on the toolbar is described below:

Button Label Description
New Extract Selects a representative text file or HTTP address and create a new extract script file.
Open Extract Opens a script file and report file that already exist.
Close Extract Closes the current report and script file. If you have made changes to the open script, you will be prompted to save the changes.
Save Extract Saves the extraction script on which you are working. If you are creating a new script, a dialog box will open in which you may enter or change the name of the script file, the author name, and a description. If you are modifying an existing script, clicking on this button will overwrite the script. To save another script with a different script name, select Save As from the File menu.
Validate Verifies that at least one accept record is defined and all data fields are assigned to at least one accept record. If these conditions are met, a dialog box confirms "Validation Successful". Otherwise, a dialog box warns you of these errors in your script.
Source Options Opens a dialog box that gives you some information about the current report. This is also where you can make selections that affect the way the Extract Schema Designer works with your report. See Source Options Window.
Source Font Changes the screen font in the Extract Schema Designer window.
Underline Fields Turns underlining of data fields ON or OFF. When underlining is turned ON, a broken line will display in the Data Panel under each defined fixed position data field. To turn Underline Fields OFF, click this button again.
Delete All Line Styles Deletes all defined line styles in the current script.
Delete All Fields Deletes all defined data fields in the current script.
Clear All Accept Record Fields Clears All ACCEPT Records Fields. This allows you to completely scrap your entire export mapping and start over. When you choose this option, you will be asked if you want to Clear Fields in All ACCEPT Records. If you choose Yes, all of your fields become unassigned again (note the icon change in the Line Style column). If you choose No, you will return to the Extract Schema Designer window. For more details, see ACCEPT Record Definition Window.
Search Text Searches for some specific text string in your report. A dialog box in which you enter the search string will open. See Find Text.
Debug Extract Opens a dialog box that offers some methods of helping you determine how well your script is reading your text or report file. This allows you to verify that your line styles and data field definitions are either correct or need some additional work. For more details see Debug Extract Design Window.
External Source Data Viewer Displays your report in another application, the application it was created in, the application that Windows associates with files of this type, or any other application of your choice. You can set a default application to use for viewing all of your files in the Source Options window. See Source Options Window.
Browse Data Records Opens the browser after defining line styles and data fields. The Browser will display data from your report in a "flattened" column and row tabular format. This allows you to verify that your line style and data field definitions are either correct or need some additional work. For more details, see Record Browser Window.
Vertical Spacer Shows or hides a red vertical positioning bar in the Data Panel in the main window. The positioning bar is helpful in determining whether or not particular text characters start and/or stop in the same position throughout the report. After turning the Positioning Bar ON, click the mouse to position the bar in the Data Panel. To turn the Positioning Bar OFF, click this button again.
Toggle Space/Tab Symbol Shows or hides the space and tab symbols. Default symbols are a small gray dot to show where spaces exist in the source file and a double right pointing arrow for tabs. In Source > Options > Display Choices, you can choose different symbols if you wish. For details, see Source Options Window. Space and tab symbols can also be toggled on and off by selecting Preferences > Show Space Symbol from the menu.
Help Accesses the documentation contents screen.

Extract Script Manager Window

The Extract Script Manager is the first window the user sees after initiating a new session of Data Parser for Unstructured Text. This is where saved scripts are listed. However, when you start the Data Parser for the first time, the list will be empty.

To create a new extraction script, click the New Extract button in the toolbar or select File > New Extract. For details, see Open a Text File, Report File or URI.

After one or more scripts have been created and saved, the following information about each appears: Extract Script Name, File Name, Author, Description and Last Updated.

The columns in the Extract Script Manager can be sized larger or smaller so the user can view the entire contents of each field. To size a column, place the mouse pointer directly over the vertical line between two of the column headings. The pointer will become a bold cross with arrows pointing left and right. Press and hold down the left mouse button and drag the pointer in either direction until you have sized the column to the desired width. Then release the mouse button.

The horizontal scroll bar just below the table allows the user to scroll left and right within Extract Script Manager to view additional columns of information. You may also use the mouse pointer to highlight the script name, and then move left or right using the left or right arrow keys.

If you have more Extracts than will display on one screen, you can use the vertical scroll bar to view additional scripts.

To highlight an extraction script (for viewing, modifying, or deleting), place the mouse pointer just to the left of the extraction script number. The pointer will change to a right-pointing arrow. Click once to highlight the entire line. The extraction script name appears in the Extract box centered just below the Extract Manager table.

To open a saved extraction script and load the original report, highlight the extraction script as described above and double-click the extract name or select File > Open Extract.

To open a saved extraction script and load a different report, open the extraction script as described above. Then select Source > Options. Choose the File Properties tab. Click the down arrow next to text file, browse to select the new file and click Open. Click OK.

To delete a saved extraction script, highlight the extraction script as described above, then press Delete.

Pattern Builder Window

The Pattern Builder window is where you specify the exact criteria for which the Data Extractor is searching when it is determining whether or not a line of text in the report should be defined as a particular line style.

The Pattern Builder Window is accessed from the Look For? cell in the Line Style Definition Window when you click the down arrow.

Available options in the Pattern Builder window are:

Edit

The menu in the upper left corner allows you to paste text from the white pane of the data display into the Value field. The paste function is only available when the Type is literal.

Type

There are five types of criteria the Data Extractor looks for. The five options are literal, character class, negated character class, mask, and regular expression.

literal

Select this option to have the Data Extractor search for some specific string of text. The string will be entered in the Value column.

character class

Select this option to have the Data Extractor search for some classification of characters, digits, or other type of criteria. A character class may be any of the classes or types of information listed below in the Value section, or a user may enter any valid string to specify their own character class.

negated character class

Select this option to have the Data Extractor search for anything EXCEPT some classification of characters, digits, or other type of criteria. A character class may be any of the classes or types of information listed below in the Value section, or a user may enter any valid string to specify their own character class.

mask

Select this option if you need to enter a special expression to define the search criteria. There are three special symbols that you may use along with any printable character to build the mask. The special symbols are @, #, and *. Each is explained below.

Symbol Description
@ Use the @ (at sign) when you want the Data Extractor to search for any alpha character. For each @ entered in the mask, the Data Extractor searches for [A - Za - z].
# Use the # (pound sign) when you want the Data Extractor to search for any digit. For each # entered in the mask, the Data Extractor searches for [0 - 9].
* Use the * (asterisk) when you want the Data Extractor to search for any alphanumeric character or digit. For each * entered in the mask, the Data Extractor searches for [A - Za - z0 - 9].

Any printable ASCII character (except the three special characters shown above) and spaces can be used as a literal in combination with the three special characters to build the mask.

Some mask examples:

To set up the criteria for any Canadian postal code, enter a mask of @#@ #@#. This mask looks for [A - Za - z][0 - 9] [A - Za - z] (space) [0 - 9] [A - Za - z][0 - 9]

To set up the criteria for any social security number, enter a mask of ###-##-####. This mask looks for [0 - 9] [0 - 9] [0 - 9] - [0 - 9] [0 - 9] - [0 - 9] [0 - 9] [0 - 9] [0 - 9]

regular expression

Select this option if you want to use a language that allows you to specify a string of characters that defines a set of rules for matching character strings. A regular expression will match a field whether it matches the whole field or just a small sequence within the field.

The following are the special characters that can be used in Data Extractor regular expressions:

| (and) * + ? [and] - . \ ^ $

To use the literal value of a special character within a regular expression, you must precede the special character with a backslash ( \ ). For example, to enter a literal backslash, you must type it twice ( \\ ); to enter a literal dollar sign, you must type backslash and then dollar sign ( \$ ).

Example:

To set up the criteria for a four-column line position test, enter (0[0-9])(1[1-5]) OR 0[0-9]1[1-5]. This expression would find a line number with the first position of 0, the second position of 0-9, the third position of 1, and the fourth position of 1-5. Yielding up to 50 different hits.

Value

If you selected literal in the Type column, the value entered here is the exact string for which you want the Data Extractor to search. When entering the value, type only the exact string. This is case sensitive. The Data Extractor automatically encloses the value in quotation marks. Also see "Set Value = Selected Report Text".

If you highlighted a section of the text before opening the Line Style Definition window, that value is generally entered here automatically.

When you select character class, negated character class or regular expression in the Type column you must either make a selection from the list box in the Value column or enter your own string. Remember, negated character class means anything EXCEPT what you specify in the Value column.

The options in the Value column list box follow:

Option Description
any character Select this option to have the Data Extractor search for any printable or non-printable character.
digits Select this option to have the Data Extractor search for any of the digits zero (0) through nine (9).
digits/ Select this option to have the Data Extractor search for any of the digits zero (0) through nine (9) or a forward slash (/).
digits- Select this option to have the Data Extractor search for any of the digits zero (0) through nine (9) or a hyphen (-).
digits/- Select this option to have the Data Extractor search for any of the digits zero (0) through nine (9) or a forward slash (/) or a hyphen (-).
letters Select this option to have the Data Extractor search for any printable alphabetical character (a - z or A - Z).
upper case letters Select this option to have the Data Extractor search for any printable upper case alphabetical character (A - Z).
lower case letters Select this option to have the Data Extractor search for any printable lower case alphabetical character (a - z).
alphanumeric characters Select this option to have the Data Extractor search for any printable alphabetical character (a - z or A - Z) or any digit from zero (0) through nine (9).
letters and white space Select this option to have the Data Extractor search for any printable upper or lower case letter (A-Z or a-z) and white space.
upper case letters and white space Select this option to have the Data Extractor search for any printable upper case letter (A-Z or a-z) and white space.
lower case letters and white space Select this option to have the Data Extractor search for any printable lower case letter (A-Z or a-z) and white space.
white space Select this option to have the Data Extractor search for a Tab (hex 09 or ANSI 009) or some designated number of spaces.
space Select this option to have the Data Extractor search for a single space.
tab Select this option to have the Data Extractor search for a tab character (hex 09 or ANSI 009). Tab expansion must be set to 0 on the Printer Emulation tab in the Source Options window in order for the Data Extractor to detect Tab characters in the file.
carriage return Select this option to have the Data Extractor search for a carriage return (hex 0D or ANSI 013).

Count

The count determines how many of the specified string(s) the Data Extractor searches for.

It is important to note that Count searches for consecutive values. For example, if you search on the string MM/DD/YY using the following settings in the Pattern Builder, no matching results display:

  • Type=character count
  • Value=alphanumeric
  • Count=3

There is no match because MM/DD/YY does not contain 3 consecutive alphanumeric characters.

Here are the Count options:

Option Description
0 - many Select this option if you want the Data Extractor to search for "None" to "Many" of the specified string(s).
1 - many Select this option if you want the Data Extractor to search for "One" to "Many" of the specified string(s).
0 - 1 Select this option if you want the Data Extractor to search for "None" to "One" of the specified string(s).
1 thru 9 Select any of these options when you want the Data Extractor to search for a specific number of the specified string(s). If you want the Data Extractor to search for more than 9 of a specified value, type the number desired in this cell.

Begin

This is the column number in which the Data Extractor should start searching for the specified string. If you modify the string in the Value column, you may need to enter a new column number here. Values only appear in the Start cell when Column is selected in Search What?. For more information about the Begin option, see Begin (line or column).

End

This is the column number in which the Data Extractor should stop searching for the specific string. If you modify the string in the Value column, you may need to enter a new column number here. Values only appear in the End cell when Column is selected in Search What?. For more information about the End option, see End (line or column).

Extract Script Designer

Extract Script Designer is the working area where you will use the mouse pointer, shortcut menus, and dialog boxes to define specific parts of the report text that you want to convert to another format.

The main window may be sized, minimized, or maximized to suit your needs.

A bold Vertical Splitter Bar splits this window into two panes. The left pane is the Line Style Column. The right pane is the Data Panel. Details about this window are discussed below.

Line Style Column

This is the lightly tinted yellow pane to the left of the Vertical Splitter Bar. When you first open a new report file, the Line Style Column will be empty. As you define lines of text, the line style names will display in the Line Style Column. Along with the line style names, one or more symbols may also display in the Line Style Column to designate the line action and/or other information about a particular line in your report. See Line Action.

The main purpose of the Line Style Column is to display the line style name and line action symbols of each defined line of text. The line style names give you a visual way to verify that each line of text in your report file matches the appropriate line style recognition pattern. The line action icons give you a visual way to identify how the Data Parser for Unstructured Text is going to process the line of text and any data fields you have defined within that line.

The mouse behavior is slightly different in the Line Style Column as opposed to the Data Panel. If you click the mouse in the Line Style Column, the entire line will be selected. When you highlight some of the data and right click in the Line Style Column, a different shortcut menu will display from the one that would display if you highlighted some data and right clicked in the Data Panel. Details about the shortcut menus are described in Shortcut Menus. For details about how to define line styles, see Defining Line Styles.

After a line of text has been defined, you can double-click a line style name in the Line Style Column to open the Line Style Definition window for viewing and/or modifying the recognition rule. For details, see Line Style Definition Window.

There is a horizontal scroll bar at the bottom of the Line Style Column. If the width of the Line Style Column is less than the longest line style name, you may scroll to the right to view the portion of the line style name that does not display. You may also drag the Vertical Splitter Bar to the right to make the Line Style Column wider.

Data Panel

This is the large white pane to the right of the bold Vertical Splitter Bar. The Data Panel is the main work area where you highlight selected text that will be used to define line styles and/or data fields.

To indicate that you want to define a line style or data field, highlight a particular selection from a line of text and click the right mouse button to bring up a series of shortcut menus from which to work. For details, see Shortcut Menus.

As data fields are defined, the text that is included in the various fields will change from black to red, green, blue, or magenta if the fields are fixed length and fixed position. Where there is more than one data field on a line of text, the colors will alternate so you can distinguish one field from another on that line. The individual colors have no particular significance. You may also have the Data Parser for Unstructured Text display a colored, dashed line under each defined data field by turning Underline Fields ON in the Preferences menu. For details about other display options, see Source Options Window.

For details about defining lines of text, see Defining Line Styles.

Details about defining data fields are discussed in the Define Data Fields.

There is a horizontal scroll bar at the bottom of the Data Panel. If the width of the Data Panel is less than the longest line of text in your report file, you may scroll to the right to view the portion of the text that does not display. You may also adjust the width of the main Data Parser for Unstructured Text window by dragging its right border to the right, or by maximizing it.

There is a vertical scroll bar on the right side of the Data Panel when the report is longer than will display in the window. You may scroll up and down in the Data Panel to view the portions of the report that do not display. The Source Sample setting in Source Options will determine how much of your report file actually shows in the Data Panel. For details, see Source Options Window.

Vertical Splitter Bar

The bold vertical splitter bar between the Line Style Column and Data Panel may be dragged left and right to adjust the sizes of the two panes of the main window. Place the mouse pointer directly on the splitter bar. The pointer will change to a bold cross with left- and right-pointing arrows. Click and hold down the left mouse button and drag the splitter bar to the desired position. Then release the mouse button.

This bold black Vertical Splitter Bar between the two panes of the Script Designer window should not be confused with the thin red Vertical Positioning Bar that can be placed in the Data Panel by clicking the Vertical Positioning Bar button on the toolbar and then clicking in the Data Panel.

Cursor Position Boxes

In the lower left of this window, below the Line Style Column, there are two boxes that indicate at which text line and column the blinking cursor is positioned. A text or report file must be opened in the Data Parser for Unstructured Text for this to be active.

Field Name Indicator

When you move the mouse pointer over a defined data field (three exceptions discussed below) in the Data Panel, the mouse pointer will change to a hand. In the lower left of this window, below the Data Panel and Line Style Column, there is a Field Position Indicator that will display the name of the data field over which the mouse pointer is located.

There are three exceptions:

  • Data fields that are defined as Floating Tags
  • Data fields that are defined as Relative Word Position
  • Subsequent lines of a data field that continues across multiple lines of text

These fields are non-fixed length and non-fixed position, and therefore will not be colored on the Data Panel.

Mouse Position Box

In the lower right of this window, below the Data Panel, there is a box that indicates at which text line and column the mouse is positioned. The number to the left of the comma indicates the text line. The number to the right of the comma indicates the column. A text or report file must be opened in the Data Parser for Unstructured Text for this to be active. And when you move the mouse pointer beyond the End of Line marker, the mouse position boxes will remain blank.

Hex and Decimal Value Box

In the lower right corner of this window, below the Data Panel, there is a box that indicates the hex and decimal value for any character that the mouse pointer is positioned over. See for a table of values and their associated characters. This box will not display a value for end of line characters. See Decimal and Hexadecimal Values.

Source Options Window

The Source Options window opens each time you begin a new extract script, unless you go to the Display Choices tab and clear the Display Source Options with New Extract check box. Otherwise, to open this window, click the Source Options button in the Tool Bar, or select Options from the Source menu. The purpose of the Source Options window is to allow you to change the way the Data Parser for Unstructured Text reads your text file.

If you are familiar with the text or report file with which you are working, the Source Options window can be opened and some selections made before opening the file. Other options can be changed to meet the requirements of the report file as you are parsing it.

If you make changes to the settings in this window after opening the report file, the Data Parser for Unstructured Text may reread and reload your file. This may take a few seconds.

The window is divided into seven tabs: Extract Design Choices, Display Choices, File Properties, Printer Emulation, Character Set, Character Filters and External Viewer. The options in each tab are discussed below:

  • Extract Design Choices
  • Display Choices
  • File Properties
  • Printer Emulation
  • Character Set
  • Character Filters
  • External Viewer
  • Extract Design Choices

Extract Design Choices

This topic covers the settings under Extract Design Choices.

Tag Separator

The tag separator selected here tells the Data Parser for Unstructured Text how to distinguish a field tag from the data field when analyzing a line of text. This is only relevant when you are making use of the Parse Tagged Data shortcut menu option.

The tag separator choices are:

Tag Separator Description
ColonSpace (: ) Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a Colon and a Space. Example: Name: John M. Smith
Colon (:) This is the default setting. This option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a colon only. Example: Name: John M. Smith
SpaceColon ( :) Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a Space and a Colon. Example: Name: John M. Smith
Dash (-) Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a dash. Example: Name-John M. Smith
Comma (,) Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a comma. Example: Name, John M. Smith
# of Spaces Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a specified number of spaces. When you select this option another box appears to the right of the tag separator box in which you type the desired value to specify how many spaces. Example with 3 spaces specified: Name John M. Smith
# of Spaces + Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by two or more spaces. Example: Name John M. Smith
Vertical Bar ( | ) Selecting this option tells the Data Parser for Unstructured Text to distinguish a field tag from the data field by a vertical bar or pipe ( | ). Example: Name|John M. Smith
Other If your report file has a character that is not on the list of choices separating the field tags from the field data, you can highlight the character shown and type in any single printable character. Example: Name*John M. Smith

For details on tagged report data, see Define Data Fields.

Column Separator

The Column Separator selected here tells the Data Parser for Unstructured Text how to distinguish one column of data from the next when analyzing a line or block of text. This setting is only relevant when you are using the Parse Columnar Data or Parse Columnar w/ Heading shortcut menu options. Each column within a line of text will become a data field.

The Column Separator choices are:

Column Separator Description
(2+) Spaces This is the default setting. This option tells the Data Parser for Unstructured Text to distinguish between two columns of data by two or more spaces. Example: 10019 John M. Smith. The account number 10019 is one column, or data field, and the person's full name, John M. Smith, which contains single spaces, is another column, or data field.
(1) Space Selecting this option tells the Data Parser for Unstructured Text to distinguish between two columns of data by a single space.
Tab Selecting this option tells the Data Parser for Unstructured Text to distinguish between two columns of data by a tab.
Vertical Bar ( | ) Selecting this option tells the Data Parser for Unstructured Text to distinguish between two columns of data by a vertical bar or pipe ( | ).
Other If your report file has a character that is not on the list of choices separating columns of data, you can highlight the character shown and type in any single printable character. Example: 10019 - John M. Smith

Flush Field Contents on Accept default

The Data Parser for Unstructured Text outputs each record type as a fixed length, consistent structure record. This means that even if a field or line does not exist in a particular section of a report, that field will still exist in the output record. Propagate and Flush are opposite ways of handling these fields that contain no data or do not exist.

Propagate Field Contents is the normal default. It causes the Data Parser for Unstructured Text to carry the field contents forward from the last section that contained data in the data field when the report contains no data in the section being processed. This is the default and is particularly useful for header information that you want repeated until a new header is encountered.

Flush field contents is the opposite of propagate field contents. With the Flush Field Contents on Accept default check box checked, fields that do not exist or are blank in the report will be blank in the output by default. This only holds true for any fields defined after this setting is made. Previously defined fields will not have their Data Collection/Output options changed.

Individual fields can still be set to Propagate Field Contents on the Data Collection/Output tab of the Field Definition window.

Do Accept at End

This option is necessary when you have defined your ACCEPT Record line as Accept Before Collecting. This is usually done when there is no line that can be consistently identified as the end of a record, but there is a consistent beginning line. Since the Data Parser for Unstructured Text writes out a record as soon as it encounters the ACCEPT Record line, the very last record in a report with Accept Before Collecting will not be written out unless this option is checked. For more information, see Line Action.

The settings in the Source Option window for Do Accept at End are the default settings for future Accept records. Select the Do Accept at End Default checkbox (optional) the line styles will do an Accept at End. You can override the default settings, by selecting the Do Accept at End checkbox in the Line Action tab of the Line Style Definition window. Values can be set individually for each Accept record.

Skip First Accept

When your report does not contain a consistent last line, you may need to define the ACCEPT Record as the first line of a header or some other consistent line of text that appears at the beginning of each section of the report. This line would have the ACCEPT Record Before Collecting line action. When this line action is chosen, the resulting first record will contain no data. It will be blank. To avoid a blank first record in your output file, turn Skip First Accept ON by checking the box. See Line Action.

The setting in the Source Options window for Skip First Accept Default is the default setting for future Accept records. If you select the Skip First Accept Default checkbox, the Accept line styles you define will skip the first blank record in the output file. You can override the default settings by selecting the Skip First Accept checkbox in the Line Action tab of the Line Style Definition window. Values can be set individually for each Accept record.

Trim Leading and Trailing Spaces

When this option is checked, the Data Parser for Unstructured Text will strip your report fields of all leading and trailing spaces. This is especially useful for parsed columnar fields where you have a set field length, but have no real idea how large each piece of data will be. For details, see Shortcut Menu - Line Style Column.

Comparisons with Numbers

This setting affects how the greater than, less than, and other comparison operators in the Line Style Definition window Operator column work. With the Numeric Comparison radio button selected, any input data will be checked to be certain it is numeric and then compared as a number. With the String Comparison radio button checked, the input data would be compared as a string, regardless of what sort of data it is.

Display Choices

The settings on this tab tell the Data Parser for Unstructured Text:

  • What portion of your report file to display
  • How to display certain characters
  • Whether or not to "pad" each line of text with spaces to its maximum width
  • Whether or not to add graph lines to the display

Each of the options is described below.

Source Sample

The purpose of these settings is to allow you to select a representative sub-sample of a large text file when defining the line styles and data fields. Using a sub-sample will greatly improve the speed at which you can work.

Remember that this is only for display purposes. After you have defined your line styles and data fields in the selected sub-sample, the Data Parser for Unstructured Text still extracts the data from the entire report file. Also, the Record Browser Window previews the extract from the entire file.

When selecting a subsample for Data Parser for Unstructured Text to read, be sure that it is representative of the most complete records in the file. The available options are:

Starting Line

By default the Data Parser for Unstructured Text starts displaying the text file at the beginning of the first line in the file (line 1). However, when selecting a representative subsample of the file, you may want the Data Parser for Unstructured Text to start reading the file on some other line.

You may highlight the default value (1) and type a new value, representing the line number from which you want Data Parser for Unstructured Text to start displaying data. Use the Cursor Position Box to help you determine the line number. For example, if you want the Data Parser for Unstructured Text to start displaying a report file at line 9, rather than line 1, simply highlight the 1, type in a 9.

Ending Line

By default the Data Parser for Unstructured Text displays the first 500 lines of the text file, beginning at line 1 and ending at line 500. The Data Parser for Unstructured Text can display many more lines of text. However, the more lines displayed in the data panel the more slowly the Data Parser for Unstructured Text will function. Also, when selecting a representative sub-sample of the file, you may want the Data Parser for Unstructured Text to display a smaller subset of lines in the file.

You may highlight the default value (500) and type a new value, representing the line number at which you want the Data Parser for Unstructured Text to stop reading. For example, if you want the Data Parser for Unstructured Text to stop reading a report file at line 35, rather than line 500, simply highlight the 500, and type 35.

Click OK in the Source Options window when you are finished making your selections.

Sample Size

As you change the settings in Starting Line or Ending Line, the Sample Size value will change to reflect the total number of lines of text that the Data Parser for Unstructured Text displays. Performance will be affected as the Sample Size increases.

Padding

Pad Lines

This option allows you to choose whether the Data Parser for Unstructured Text pads each line of text in your report out to a fixed right margin, or reads each line only to the last character on that line. The default setting is OFF. You can turn it ON by clicking in the white square to the left of Pad Lines. This removes the check in the box.

When a report has a ragged right margin, this option will assist you in defining data fields to a width that does not cause the data in wider fields to be truncated. It can be misleading, however, if you are defining line styles. The spaces used to pad the lines in the display do not actually exist in the text file. Do not use them in recognition patterns for line styles.

Pad Line Length

When the Pad Lines option is turned ON, the Data Parser for Unstructured Text will pad each line of text in the selected sample out to the longest length necessary to accommodate the data. That value is displayed in Pad Length. For example, in TUTOR1.REP, the default Pad Line Length is 52.

If you want the Data Parser for Unstructured Text to pad with spaces past the last character in all the lines of text, you may change the value of Pad Line Length. Highlight the default value and type a larger value. Performance will be affected as the value increases.

Symbols

Show EndLine Symbol

This is a toggle that allows you to choose whether or not to display a symbol at the end of each line of text. The default setting is ON. You may turn it OFF by clicking on the box that contains the check mark.

Symbol

If you want the Data Parser for Unstructured Text to display an end line symbol, this is a list of symbols from which to choose. The options are a paragraph mark (¶) or {EOL}. Select the desired choice from the list box.

Show Space Symbol

This is a toggle that allows you to choose whether or not to display a symbol where each space character exists in the display. This includes both space characters that exist in the source text file and also any characters that have been added to the Data Panel by the Pad Lines option. The default setting is ON. You may turn it OFF by clicking on the box that contains the check mark.

Symbol

If you want the Data Parser for Unstructured Text to display a space symbol, this is a list of symbols from which to choose. The options are a small dot ( × ) or a period ( . ). Select the desired choice from the list box.

Show Tab Symbol

This is a toggle that allows you to choose whether or not to display a symbol where a tab character exists in the source report file. The default setting is ON. You may turn it OFF by clicking on the box that contains the check mark. The symbol will only appear if tab expansion is set to 0 on the Printer Emulation tab of the Source Options window.

Symbol

If you want the Data Parser for Unstructured Text to display a tab symbol, this is a list of symbols from which to choose. The options are a small double right angle bracket ( ) or a single right angle bracket ( > ). Select the desired choice from the list box.

Graph Paper

Show Horizontal Lines

This is a toggle that allows you to choose whether or not to display horizontal lines under each line of data in the Data Panel. The default setting is OFF. You may turn it ON by clicking on the box that contains the check mark.

Show Vertical Lines

This is a toggle that allows you to choose whether or not to display vertical lines in between each line of data in the Data Panel. The default setting is OFF. You may turn it ON by clicking on the box that contains the check mark.

Options

Display Source Option with New Extract

This tells the Data Parser for Unstructured Text to automatically show the Source Options dialog box each time a new extract script is created. The default is ON. If you do not want the Source Options dialog box to open when you create a new extract script, turn it OFF click the box containing the check mark.

File Properties

The options listed under File Properties are settings that apply globally to the current text or report file. The settings also are written to the script and are stored in the Data Parser for Unstructured Text database.

These selections should be made prior to opening the text or report file, if you know which options to choose. When you change these options after the report or text file is loaded, the Data Parser for Unstructured Text will reread and reload the report. This may take a few seconds.

The exception is the Text File choice. This will always open with the text file that was used originally to define the script. If you want to change it, you must open the script first.

Text File/URI

This is the report file that you are using, or URI to which you are connecting. If you want to check or run the script that you have designed against a different report file, you can click the down arrow and browse to the new report file. The Data Parser for Unstructured Text will load the new file and apply the script that is presently open to that file.

Line Separator

A text file is presumed to have a carriage return-line feed (CR-LF) at the end of each line. However, some files have different characters at the end of the line.

To specify some other line separator, or to specify the number of bytes of each line of a fixed length report, place the mouse pointer in the Line Separator box and click once. Then click the down arrow to the right of the box and click the desired Line Separator in the list. The list box choices are: carriage return-line feed (default), line feed, carriage return, line feed-carriage return, form feed, empty line, and number of bytes.

If you select number of bytes, enter the length of each data record in the Byte Count box that appears just below the Line Separator box. The Data Parser for Unstructured Text data display supports record lengths up to 32,000 bytes.

Caution: If your source file contains Carriage Returns, Line Feeds or Form Feeds, use the Character Filters tab to replace each of these with blanks before using the number of bytes Line Separator. For information on using character filters, see Character Filters.

If the line separator is not one of the choices from the list box and is a printable character, highlight the CR-LF and then type the correct character. For example, if the separator is a pipe ( | ), type the pipe character on the keyboard.

If the line separator is not one of the choices from the list box and is not a printable character, highlight the CR-LF and then enter a backslash ( \ ), an X, and then the hex value for the correct separator. For example: Enter \X0C to specify a form feed. See Decimal and Hexadecimal Values.

The line separator can be more than one character. If so, type the characters, or type a backslash X and the hex values for the characters.

Example 1: EOL (for the three capitol letters E, O, and L together)

Example 2: \X0C \X0D (for form feed, carriage return)

Note: Because a majority of files containing CR-LF, CR or LF line separators often include a combination of those three separators, if you select CR-LF, CR, or LF as your line separator, Data Parser for Unstructured Text uses a loose definition of line separator to include solo LFs, CRs, or CR-LFs as well. All other line separators look only for the line separator you specify.

Field Separator

The field separator selected here tells the Data Parser for Unstructured Text how to automatically distinguish fields when analyzing a line of text. The field separator applies to the entire text file on which you are working. If you need to define a data field based on Relative Field Position, select the correct field separator before defining those fields.

The field separator choices follow:

Field Separator Description
Asterisk ( * ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual words when they are separated by an asterisk.
Colon ( : ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual words when they are separated by a colon.
Comma ( , ) Selecting this option tells Data Parser for Unstructured Text to distinguish individual words when they are separated by a comma.
Dash ( - ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual fields when they are separated by a dash.
Caret ( ^ ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual fields when they are separated by a caret.
Semicolon ( ; ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual fields when a semicolon separates them.
Slash ( / ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual fields when they are separated by a slash.
Space ( ) A space is the default setting. This option tells the Data Parser for Unstructured Text to distinguish individual fields when they are separated by a single space (hex 20).
Tab Selecting this option tells the Data Parser for Unstructured Text to distinguish individual fields when a tab (hex 09) separates them.
Tilde ( ~) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual words when they are separated by a tilde.
Vertical Bar ( | ) Selecting this option tells the Data Parser for Unstructured Text to distinguish individual fields when a vertical bar separates them.
Other If the word separator in your report is not on the list of selections, you can highlight the current value and type in any single character. If the character cannot be typed, you can enter the hex value for that character with \X in front of it. For example: \X0A will enter a Line Feed as the word separator.

Printer Emulation

The options listed under Printer Emulation are settings that apply globally to the current text or report file. The settings also are written to the script and are stored in the Data Parser for Unstructured Text database.

These selections should be made prior to opening the text or report file, if you know which options to choose. When you change these options after the report or text file is loaded, the Data Parser for Unstructured Text will reread and reload the report. This may take a few seconds.

Printer Emulation

Printer Emulation gives the Data Parser for Unstructured Text the ability to read print control characters. This allows the Data Parser for Unstructured Text to display the report with the same appearance it would have had if printed.

None is the default setting. The Data Parser for Unstructured Text reads and interprets the ANSI characters between and including 32 - 126.

Printers that the Data Parser for Unstructured Text can presently emulate include AS400 and 1403. Other printers will be added to the list in future releases.

Tab Expansion

If your text file has embedded tab characters representing white space, you can expand those tabs to a set number of spaces. The default value is eight (8). To change the value, highlight the default and then type in the desired value.

If the Tab Expansion is set to 0, then tab characters in the source file will not be replaced with spaces. The tab characters, themselves, can then be used as field separators, column separators, begin or end tags, or other uses. If Show Tab Symbol is selected in the Display Choices Tab of the Source Options window, a symbol will display everywhere a tab character exists in the source text or report file.

Character Set

The options listed under Character Set are settings that apply globally to the current text or report file. The settings also are written to the script and are stored in the Data Parser for Unstructured Text database.

These selections should be made before opening the text or report file, if you know which options to choose. When you change these options after the report or text file is loaded, the Data Parser for Unstructured Text will reread and reload the report. This may take a few seconds.

Code Page

Select the Code Page Translation Table that the Data Parser for Unstructured Text will use when reading an EBCDIC report file.

Tip: If you change ANSI to EBCDIC and then back to ANSI, the data window appear to lose data, but this is a display issue resulting from use of different default line separators in the two character sets. To return the display to readable characters, in the File Properties tab pane, change the line separator CR-LF to CR, click OK, then reopen the settings and change the line separator back to CR-LF.

Character Filters

The options listed under Character Filters are settings that apply globally to the current text or report file. The settings also are written to the script and are stored in the Data Parser for Unstructured Text database.

These selections should be made prior to opening the text or report file, if you know which options to choose. When you change these options after the report or text file is loaded, the Data Parser for Unstructured Text will reread and reload the report. This may take a few seconds.

You can choose individual characters to filter by clicking in the cell next to that character and selecting delete, leave alone or replace with space from the list.

Note: The character with decimal value 159 affects the line display. Filter this character while designing your script.

Reset Defaults

If you have selected some characters to filter and want to set those characters back to the original settings, click Reset Defaults.

Note: All but two characters default to leave alone. 00 (NUL) characters default to replace with space and 0C (FF) characters default to delete.

Filter Non-Print

If you want to have all nonprintable characters other than carriage return-line feed replaced with spaces in the display, click the Filter Non-Print button. Filtering can be useful if you have unusual characters that interfere with reading the file correctly or with the ability of the script to recognize certain lines.


When you have finished making your selections on the tabs in the Source Options window, click the OK button to save your selections.

How to Create a Report File

In order to read a report into the Data Extractor, it must exist as a disk file on your PC or network drive to which you have access/permissions. The following section includes suggestions for accomplishing this.

Reports on a Mainframe

If the report was generated on a mainframe, it must be spooled to a file and downloaded to your PC or network using a communications package. The report may be in ASCII or any standard EBCDIC format.

Reports on a PC

If the report was generated in a PC application, you may do either of the following:

  • Print the report to a regular paper printer, then scan and OCR it to a disk file.
  • Print the report directly to a disk file from the application by following the steps below to set up a special print to file driver on your PC.

To set up the Generic/Text Printer

  1. In Windows XP/2000/2003, open the Printers folder.
  2. Double-click the Add Printer option to start the Add Printer Wizard.
  3. In the first Add Printer Wizard dialog box, click Next.
  4. Select Local Printer and click Next.
  5. From the list of manufacturers, select Generic. The Generic/Text Only entry displays in the Printers box on the right and is highlighted. Click Next.
  6. From the list of available ports, select File and click Next.
  7. In the next dialog box, the Printer Name displays as Generic/Text.
  8. Select whether or not you want this printer to be the default printer in your Windows applications, then click Next.
  9. The next dialog asks if you want to print a test page. Select No and click Finish. Windows prompts you for the Windows installation disks or CD from which it builds the printer driver.
  10. Follow the instructions during the setup of the printer driver.

You can now print to a disk file from any Windows application by selecting the Generic/Text printer.

Tips for Creating the Report in the Application (not requirements)

  • Select a fixed font, such as Courier, so the information in the report is positioned consistently.
  • Use field tags (field names), or create a columnar report with headings, to identify the data more easily in the Data Extractor.

URI Support

The Data Extractor offers dramatic source flexibility using URI support. The full addressing scheme must be present before the Data Extractor can connect to the source file.

Note: A minimum of v1.4.0 of Sun Java Runtime Environment (JRE) must be installed for the URI to function properly. Without this component, you will receive the error message: "Unable to load Java virtual machine."

For more information on URIs within the Data Extractor, see Dealing with URI Limitations below.

What Are URIs?

URIs (Uniform Resource Identifiers) are a modern subset of URLs, that were originally nothing more than HTTP and FTP locations on the Internet. A URI typically describes the following:

  • The mechanism used to access the resource
  • The specific computer in which the resource is housed
  • The specific name of the resource (a file name) on the computer

For example, this URI...

http://www.yahoo.com

...identifies a file that can be accessed using the Web protocol application, Hypertext Transfer Protocol ("http://"), that is housed on a computer named "www.yahoo.com" (which can be mapped to a unique Internet address). In the computer's directory structure, the file is located at "/support.asp".

File Transfer Protocol addresses and e-mail addresses are also URIs (and are also called a URL).

The common syntax for scheme-specific data is:

file://:@:/

Some or all of the parts ":@", ":", ":", and "/" may be excluded. The scheme specific data starts with a double slash ( // ) to indicate that it complies with the common Internet scheme syntax. The different components obey different rules, as follows:

Component Description
user An optional user name. Some schemes (e.g., ftp) allow the specification of a user name.
password An optional password. If present, it follows the user name separated from it by a colon. The user name and password, if present, are followed by a commercial "at" sign ( @ ). Within the user and password field, any colon ( : ), "at" sign ( @ ), or slash ( / ) must be encoded.
host The fully qualified domain name of a network host, or its IP address as a set of four decimal digit groups separated by periods ( . ).
port The port number to connect. Most schemes designate protocols that have a default port number. Another port number may be supplied, in decimal, separated from the host by a colon. If the port is omitted, the colon is as well.
url-path The remainder of the locator consists of data specific to the scheme, and is known as the "url-path". It supplies the details of how the specified resource can be accessed. Note: The slash ( / ) between the host (or port) and the url-path is not part of the url-path.

Dealing with URI Limitations

URI support is currently limited to "public" WWW sites, where the navigation to the page containing web content you want to extract is directly addressable with one URL hyperlink (e.g., cnn.com). You cannot connect to interactive session resources like TELNET or RLOGIN.

While this URI support is a huge benefit, and reaches hundreds of millions of pages, there are still many URLs that are not directly addressable. You must first go through some painstaking navigation, authentication, looping, etc., before you arrive at the page containing the data to be extracted.

Solutions for overcoming this limitation at design time include:

  • Manual Human navigation to the desired web page (entering the requisite passwords, etc.), and then doing a Save As to a local file. That file could then be brought in and processed using Data Extractor.
  • Automated Human navigation to the desired web page (entering the requisite passwords, etc.), and then doing a SaveAs to a local file. That file could then be brought in and processed using Data Extractor.

Pervasive Data Inspectors Online Help - Table of Contents

Pervasive DataTools

DataTools Data Inspectors User’s Guide



Pervasive Software Inc.

12365 Riata Trace Parkway

Building B

Austin, TX 78727 USA

Web: http://pervasivedatatools.com



About This Manual

This manual leads you through the operation of the Data Inspectors user interface. The Pervasive Data Inspectors allow you to analyze data in dozens of databases, applications, or file formats. You must have a non-expired Data Inspector license to run each product.

Refer to the license.txt file in the default installation directory for disclaimers and information about trademarks and credits.

Table of Contents

Important Installation Instructions - Data Inspectors

The installation and operation of the Pervasive Data Inspectors requires the installation of Pervasive DataTools v9. After you order a trial or a license subscription for one or more of the Data Inspectors, follow the steps below, and in the order written, to install DataTools v9 and Data Inspector.

1. Pre-Installation Requirements

Follow ALL steps in the System Requirements and Installation Pre-requisites document. Do not install DataTools until/unless all of the specified requirements and pre-requisites are performed and/or verified.

2. Download Pervasive DataTools v9

The DataTools v9 installation EXE file can be downloaded from the Product Downloads page of the DataTools web site.

Download the installation file to your workstation's desktop.

3. Install Pervasive DataTools v9

Click on the DataTools v9 installation EXE file to start the installation wizard. Follow the prompts until the installation is finished. The installation may take several minutes.

4. Launch Pervasive DataTools v9

Launch Pervasive DataTools v9. When prompted, enter your login credentials that you used on the Pervasive DataTools web site to order your DataTools products.

This step is a good way to verify that you have installed DataTools v9 properly before installing Data Inspectors. If DataTools v9 will not start and run, do not proceed to the Data Inspector installation until the DataTools v9 installation is completed and DataTools v9 runs properly on your workstation.

5. Download Data Inspector

The DataTools v9 installation EXE file can be downloaded from the Product Downloads page of the DataTools web site.

Download the installation file to your workstation's desktop.

6. Install Data Inspector

Click on the Data Inspector installation EXE file to start the installation wizard. Follow the prompts until the installation is finished. The installation may take several minutes.

7. Launch DataTools v9

After installing Data Inspector, launch DataTools v9. The DataTools Launcher should find your Data Inspector license file via our web service and prompt you to download the file to your workstation.

8. Launch Data Inspector

Once the Data Inspector license file has been downloaded to your workstation, highlight the Data Inspector product in the DataTools Launcher's grid and then click the Launch button located under the grid. The Data Inspector should launch.

Introduction

The Data Inspectors are simple data quality analysis tools. Each Data Inspector provides an easy-to-use method of analyzing a data source.

Functionality

Data Inspectors perform the following actions:

  • Analyzes a data set accurately and efficiently
  • Generates a report and a graph of the analysis of your data
  • Using the report, you can isolate irregularities, quantify business rule violations, and identify null sets or values
  • Provides connectivity for many data sources including databases, applications, and data file formats

Features

Each Data Inspector includes the following features:

Simple three-step workflow:

  1. Connect to the source data.
  2. Select one or more fields/columns to analyze.
  3. Click the Run button.

Predefined metrics return the results in tabular and graphic formats.

Design Preferences

The Data Inspector Design Preferences dialog box enables you to change some of the design time preferences for the following items:

  • AutoGen Defaults
  • Fonts
  • Logging

Autogen Defaults

The Autogen Defaults tab enables you to customize the Row Sample Size and Distinct Value Field limit.

By default, Row Sample Size is set to 10000 and Distinct Value Field limit is set to 15. This default set up is to limit the processing time and memory required to set up and run an analysis.

Fonts

The Fonts tab enables you to customize the fonts used in the Data Inspector user interface.

Logging

The Logging tab enables you to rename the log file name. By default, the log file name is Data_Investigator.log.

Source Connection Tab Overview

Data Inspectors can analyze data from many sources. Each Data Inspector supports a specific database, application, or data file format. Before you can analyze data from a particular source, you must identify the source connection type and connect to a database table, application entity, or data file.

To connect to the source data, you must specify the following:

  • Source connection
  • Data location
  • Data format

The next few pages of this manual provide details for each.

Source Connection

The only source connections that are available in the Data Inspectors are those whose structure is pre-defined by the existence of complete metadata. In other words, if you wish to analyze the data from a Microsoft SQL Server database table, there is a Data Inspector available. However, if you wish to analyze data from a report file or from a fixed ASCII, Btrieve or C-ISAM data file, you must first use one of the Pervasive Data Parsers to parse, unpack, and export the data to a CSV text file format.

Before you design your data analysis specification, you must connect to a data source. The following sections will guide you through the steps and the available options.

  • Source Connection Tab Overview
  • Select the Data Source Connection Type
  • Connecting to the Source Data
  • Managing the Source Connection

Select the Data Source Connection Type

  1. The Source Connection tab opens upon launching the Data Inspector.
  2. The Select Connection dialog box displays the following three tabs:

    • Most Recently Used (Default) - Lists the connections that have been recently used.
    • Factory Connections - Lists the connections supplied with Data Inspector.
    • User Defined Connections - Lists connections that you have saved.
  3. Click a tab to select your connection type.
  4. The selected connection tab provides a list of available connections in the specific Data Inspector for which you have a license.

  5. Select your applicable source connection from the list.
  6. Click OK to apply the selected source connection.
  7. Alternatively, double-click the correct source connection to apply the current source connection. Data Inspector returns to the previous screen with the Source Connection tab open and the correct source connection listed in the Source Connection box.

Connecting to the Source Data

The fields that are displayed depend on the type of connection you select. However, regardless of the connection type, you should always see the following fields:

  • Connector - Lists the connectors you chose for this connection. This field cannot be edited.
  • Structured Schema (Visible, but disabled) - The Structured Schema field is visible but disabled in the Data Inspectors.
  • Source File/URI - Defines the data source location.

The label may vary based on the connection name; however, it is usually labeled Source File/URI.

Click the drop-down list to navigate to the source file.

To Connect to the Source Data

  1. Click the Source File/URI arrow and navigate to the file location.
  2. Data Inspector enables you to provide the source file / URI in any of the following ways:

    • Select a data source using the Source Connection arrow, and click the Open button (or double-click the source).
    • You return to the previous screen with the Source Connection tab open and the correct source in the Source Connection text entry box.
    • Enter the source location directly into the text box, but if you do, you must click the Connect button or press the Enter key for Data Inspector to connect to this source.
    • If the source file you want to use is not accessible by directory search (that is, if it is at a remote connection accessible over a network), then specify a uniform resource identifier (URI) in the Source File/URI box either by navigating to it and selecting it in Source File/URI or by inputting it directly into the Source File/URI box.
  3. Optional: Enter the appropriate values into any other active text entry boxes (such as UserID and Password if they are available on the Source Connection tab.).
  4. The steps required for these tasks varies according to the combination of source connection, data source, and profiling needs.

  5. Optional: If you want to customize the connector properties, change the property fields, and click Apply.
  6. For more information on editing the connection properties, see Managing the Source Connection.

  7. Click the Connect button if it is active.

Managing the Source Connection

In addition to the connect information on the left side of the Source Connection tab, there is a list of connector properties in the right panel. Not all files with the same connection have all of the same connection properties, and it may be easier to write a profiling specification if some of the properties are changed.

For instance, some ASCII-delimited files have a header row giving field names, while others do not. The default properties for the ASCII (Delimited) connector assume that there is no header row. If the file does have a field name header, changing the Header property so that Data Inspector will pick field names up makes defining a profiling specification easier. For the case of the Header property, double-clicking False changes the value to True. This lets Data Inspector read the field names from the header, so it uses actual field names (Name, Address, and so on) instead of the default field names (Field1, Field2, and so on) in setting up profiling metrics.

To Verify a Connection Property

  1. Click the Source Data Browser icon (magnifying glass) in the task bar, or select View > Source Data. This opens the Source Data Browser in which you can check the data formatting.
  2. Look at a sampling of the records to ensure all the fields are properly parsed.
  • If they are, go on to To Save a Source Connection.
  • If they are not, pick another connection or change the connection properties.

To Change a Connection Property

  1. Click in the Value field next to the Property name.
  2. For some properties, a drop-down list arrow appears in the value cell. Other properties may require you to specify a file or enter a text value.

  3. Indicate the desired value. Do one of the following:
    • If there is a drop-down list, open the list and click the desired value.
    • If there is no drop-down list, enter the desired value directly in the cell.
    • If the connector property is a boolean value, you can also double-click the value to toggle it to the other state.
    • Click Apply to save your selections, or click Restore to stay with the original value.

NOTE: After making a change to one or more of the connection property options, if you do not click the Apply button, the settings revert to the previous value before your change was made.

To Save a Source Connection

  1. Click the Save Source Connection icon to the right of the Source Connection field. The Save Source Connection Document As window opens.
  2. Name the edited connection file at the File name text box and give it a location at the Save in browser box. The file is saved with an .sc.xml extension.

You can reuse the saved connection by selecting it in the User Defined Connection window.

Select Data Fields and Records to Analyze

  1. Connect to the data source as described in Source Connection.
  2. Click on the Results tab.
  3. Click on the Source Data Sample button to open the list of data fields/columns in the data source.
  4. Click on the checkbox to the left of one or more data fields.
  5. Click on the OK button to close the dialog box.

By default, the field list will offer only the first 15 data fields from your data source.

To change the number of visible fields, follow these steps:

  1. Click on the View menu.
  2. In the View menu, click on Preferences.
  3. In the Preferences dialog box, click on the AutoGen Defaults tab.
  4. Highlight the default value (15) in the Distinct Value Field Limit box and enter a value that will display the data field(s) in the filter grid on which you wish to run the analysis.
  5. Click the OK button to close the Preferences dialog box.

The Data Inspector will run an analysis on the selected data field(s) and will return a table and a pie chart of Distinct Values on the field or fields.

By default, the Data Inspector will run the Distinct Values analysis on the selected fields of the first 10,000 data records.

To change the number of data records on which to run the analysis, follow these steps:

  1. Click on the Source Data Sample button in the button bar.
  2. In the right portion of the Data Filters dialog box, highlight the default value in the "end with:" box to a value that suits your analysis requirements.
  3. Click the OK button to close the Data Filters dialog box.

Note: The Distinct Values analysis requires significant amounts of processing. The selection of a large number of data fields in a large data set may significantly increase the time to completion.

After selecting the desired data fields and data records on which to run the Distinct Values analysis, click on the Run Profile button in the button bar.

Run the Analysis

After connecting to the data source and selecting the desired data fields and data records on which to run the Distinct Values analysis, click on the Run Profile button in the button bar.

Depending on the number of data fields and data records you chose to analyze, the analysis will take a few seconds to a few minutes to complete. Note: The Distinct Values analysis requires significant amounts of processing. The selection of a large number of data fields in a large data set may significantly increase the time to completion.

When the analysis is complete, a table and a pie chart showing the results will display on the Results tab.

View the Results

After the analysis is complete, the Results tab will display two tables (on the left) and a pie chart (on the right). The following explains how to read the results.

Table Results

Upper Table

The information displayed in the upper table of the Results tab is the result of the Distinct Value analysis of the selected data source.

You will see at least two rows of information listed in the upper table. The first row contains the number of records that were analyzed in the data source. The second row contains the field name of one of the first source data field that was analyzed and the Distinct Count for that field.

  • If you chose only one data field, there will be no additional rows in the upper table.
  • If you chose multiple data fields, there will be one row in the upper table for each field.

The Distinct Count value reflects the number of distinct values that occur in the data records for each selected data field.

Note: Both the "Field / Column" and "Distinct Count" columns can be sorted in ascending and descending order by clicking on the yellow column heading in the upper table.

Lower Table

The lower table of the Results tab displays a list of the Distinct Values for each data field as you click on each row in the upper table.

Note: Each of the three columns can be sorted in ascending and descending order by clicking on the yellow column heading in the lower table.

Pie Charts

Along with the tables describing the results, a pie chart is also displayed on the Results tab. This pie chart provides a visual representation of the statistics as you click each row in the upper table.

Note: If the Distinct Values results display as very narrow sections in the pie chart, you may want to maximize the size of the Data Inspector window for better viewing. The splitter bar between the tables and the pie chart can also be adjusted left and right to re-size each side of the Results tab.

Clear the Analysis

After running one analysis, if you want to "start over" and run another analysis, follow these steps:

  1. On the Results tab, click on the yellow "Field/Column" column heading to sort the upper table by the field names.
  2. Once sorted, the _Record_Count row should be at the top of the list. Click once on that row so no pie chart displays in the right side of the Results tab.
  3. Click on the Clear Results button in the button bar, and verify that you want to close the current analysis without saving it.
    • If you want to save the current analysis, click Yes when prompted, and save the current analysis.
    • If you do not want to save the current analysis, click No when prompted.

Save the Analysis

After designing and running a Distinct Values analysis in the Data Inspector, you may save the analysis by following these steps:

  1. Click on File in the menu bar.
  2. In the Windows common dialog box, navigate to the drive and folder where you want to save the analysis.
  3. Enter a name for the analysis script that will be saved. Do not overwrite the default file extension of .dp.xml.
  4. Click the Save button to save the analysis script.

To open a saved analysis, follow these steps:

  1. Click on File in the menu bar.
  2. In the Windows common dialog box, navigate to the drive and folder where your analysis script is saved.
  3. Select the desired script file with a .dp.xml file extension.

Pervasive Data Joiners Online Help - Table of Contents

Pervasive DataTools

Data Joiner User’s Guide



Pervasive Software Inc.

12365 Riata Trace Parkway

Building B

Austin, TX 78727 USA

Web: http://pervasivedatatools.com



About This Manual

This manual leads you through the operation of the Data Joiner user interface. The Pervasive Data Joiner allows you to join two heterogeneous data sets and export the joined results set to a CSV Text file. You must have a non-expired Data Joiner license to run this application.

Refer to the license.txt file in the default installation directory for disclaimers and information about trademarks and credits.


Table of Contents


Getting Started with Data Joiner

Source Connections

Join Settings

Exporting the Joined Data

Browsing the Data

Saving and Reusing Connections

Using the Query Builder

Saving and Reusing Join Designs

Appendix

Introduction to Data Joiner

Data Joiner is an application used to join heterogeneous data. It allows you to do a full (or outer) join, an inner join, a left join or a right join. The Data Joiner user interface consists of three tabs for specifying source information and a run button for exporting the joined result set to an ASCII or Unicode delimited text file.

  • Source1 Connection
  • Source2 Connection
  • Join Settings

This application allows you to join only two source data sets. However, each source can be data from a file, a table, a view, or a SQL statement. When a source data set is described using a view or a SQL statement, the view or SQL statement itself could be joining multiple tables from the same database.

To successfully export joined data, you will first have to connect to Source 1, connect to Source 2, specify the keys to join on, and which fields from each source you want in the resulting data set. The Data Joiner allows you to browse each of your source data sets, test the join results and browse the exported result set.

Preferences

You can access Join Designer preferences by selecting the View > Preference menu item. The preferences dialog allows you to set the application font and set a description for a join before you save the join design.

Join Designer Preferences - Fonts


Join Designer Preferences - Join Design Properties

What do you want to do?

For each task, we have provided a quick link to the section of the Data Joiner documentation that describes how to perform the specific task.

Task Help Topic
Define a Join Source Go To Source 1 Connection
Browse Source Data Go To Browsing Source Data
Set Join Properties Go To Join Settings
Test a Join Design Go To Testing the Join Design
Export Joined Data Go To Exporting Joined Data
Browse Exported Data Go To Browsing Exported Data
Build a Query Statement Go To Using the Query Builder
Create a Reusable Source Connection Go To Creating a ReUsable Source Connection
Save a Join Design Go To Saving the Join Design

Source 1 Connection (Tab 1)

The first Data Joiner tab is for specifying one of your two sources. You can think of it as the left side of a join.

Source1 Connection Tab

Choose a source connector type by clicking on the down arrow to the right of the Source1 Connection textbox. The second tab of the connection selection dialog that pops up will display a list of Factory Connections. Once you have created a couple of designs, the first tab will display connections that were recently selected or used. The third tab of this dialog allows you to choose previously created User Defined Connections. See Also: Creating ReUsable Source Connections.

Select a Factory Connection

Once you have selected a connector type, you’ll need to fill in the remaining source connection information. The connection information fields on Source1 Connection tab will vary depending on the type of connector selected. If your source connector is a SQL type source, you have the options of specifying a database table, an SQL statement, or an SQL file which is a file containing an SQL statement. For information on using the Query Builder to create an SQL statement, see Using the Query Builder.

You may also need to change the default connector properties in the grid to the right based on the particular source data you are trying to connect to. After changing any of the connector properties, press the Apply button to apply those changes.

After you enter enough information about your source connection, the application may automatically connect to the source. If it does not, press the connect button. You will know you are connected to the source data when the connect button is disabled and the connect arrow on the Source1 Connection turns solid.

Source1 Connection Completed

Source 2 Connection (Tab 2)

Data Joiner tab two is used to specify the connection information for your second source. You can think of this as the right side of a join. Select a connector type and fill in all appropriate information that allows a successful connection to the data.

Refer to Source 1 Connection (Tab 1) for details.

Join Settings (Tab 3)

Data Joiner tab three is used to specify additional join information about both sources. In particular, the join engine needs to know if your source data is already sorted. If it is not, the join engine will sort the data first. This will add additional processing time, especially for large data sets.

You can specify which fields in both sources you want in your joined results data set. You do not have to include all fields in the results. If source 2 contains a field with the same name as a source 1 field, the application will automatically enter an alias name for that field. You can change the alias name. The result set will contain the source 1 field name and the alias name for the source 2 field to avoid duplicate field names in the joined result set.

Finally, you will need to specify the key(s) for both sources. The keys will be used to join the data on the specified key fields.

Join Settings Tab

Browsing Source Data

Once you have successfully connected to a source data set, you can browse the data by selecting the View->Source1 Data or View->Source2 Data menu items or by clicking on the browse toolbar button drop down and selecting Browse Source1 Data or Browse Source2 Data. These menu items will only be enabled when a source connection has been made.

Browse Button
Browse Data Window

Testing the Join Design

At any time in the join design process, you can validate the join by pressing the validate toolbar button to see if you have included enough information to successfully join the two source data sets.

Validate Button

Upon successful validation of a join design, you can test the join by selecting the View>Joined Data (Test) file menu item or the browse toolbar>Browse Joined Data (Test) submenu item. Testing the join will do the following:

  • Save various join design files in a temporary directory
  • Join the data
  • Display the joined data in the data browser

Depending on the size of the source data sets as well as whether or not the source data sets need to be sorted, this could take some time.

Exporting Joined Data

Click the Export Joined Data toolbar button or select the File>Export Joined Data menu option to export the joined result set.

Export Joined Data Button

An export dialog will be displayed allowing you to choose the export target connector type and file location as well as set any target connector properties. In particular, if you want the first record in your target file to be a list of the result set field names, you’ll want to set the Header property to True. You can also choose to export a subset of the joined record set either by choosing to export a sample number of joined records or by setting an export filter for exporting only certain joined records.

Export Options


When you have finished entering target export information, press the Export Joined Data button to export the data. A run status dialog will be presented. Depending on your source data set size and whether or not they are sorted, running the join and exporting the data can take some time.

Browsing the Exported Data

Once the export is complete, you can browse the exported data file by selecting the View->Exported Data menu item or the Browse toolbar button > Browse Exported Data sub menu.

Browse Button

Browsing the Log File

Exporting joined data performs a data transformation from the joined source to the export target. Statistics about this transformation will be logged to a file. Statistics include number of records read and number of records written. It will also include any errors that may have prevented the transformation from occurring. You can view the log file by selecting the View->Log File (DataJoiner.log) menu item or by clicking on the browse toolbar button and selecting the Browse Export Log File (Data Joiner.log) sub menu item.

Browse Button

Creating Reusable Source Connections

If you find yourself using the same source in multiple join designs, you can enter the source connection information once, create a user defined connection, then select that user defined connection in future designs.

To create a user defined connection out of Source 1, fill in all the appropriate connection information on the first tab. Set whatever properties are needed for the connection as well. After browsing the source data to verify that the connection information is correct, press the save toolbar button that’s to the right of the source connection textbox (top right).

Save Connection Button

A dialog will appear asking you to name the user defined source connection. Connection files have a .sc.xml extension.

Once you have saved a user defined connection, you can use it in any future Join Design by clicking on the down arrow to the right of the Source1 Connection textbox, going to the third tab of the Select Connection dialog and choosing your saved connection. Tab 1 will automatically be populated with all connection information that was stored as a part of your user defined connection.

User Defined Source Connections

Using the Query Builder

When creating a SQL statement to use for a source connection, you can use the Query Builder. You can access the Query Builder by selecting the Query Statement option and pressing the Build Query button.

Build Query Button


Query Builder Window

For documentation on how to use the Query Builder, see Appendix A.

Saving the Join Design

Click the Save Join toolbar button or select the File->Save menu item to save a join design.

Save Join Button

Saving a join design saves a number of files needed by the join engine. Your saved join will consist of a JoinName.jd.xml file, JoinName_Source1.sv.xml and .svf.xml files, and JoinName_Source2.sv.xml and .svf.xml files. All of these will be saved a default workspace and repository created by Data Joiner.

Appendix A – Integration Query Builder

The Integration Query Builder consists of following five panes:

  • The upper-right Connections pane, where you set up database connections
  • The lower-right DB Browser pane, which displays the tables and views available for use in the currently active database
  • The upper-left Relations pane, in which relationships between database objects are defined
  • The center-left Columns pane, in which you select fields and define criteria, as well as grouping and sorting
  • The lower-left SQL pane, in which the finalized SQL command is displayed and more complicated editing can be done

Connecting to the Database

  1. Select New Database Connection from the Database menu.
  2. Enter a unique name for this connection in the Connection Name box.
  3. Select the Data Access Type (see Table 1 below).
Table 1. Data Access Types
OLE DB Dynamic Data Access: ADO
If you select OLE DB Dynamic Data Access: ADO, the standard Windows Data Link Properties dialog box displays. Select the OLE DB provider you want to use from the list available. After selecting the appropriate OLE DB Provider, click the Next button to access the Connection tab to enter more details about the connection. The information required depends upon the provider selected. Click Test Connection to ensure you have entered the correct details.
OLE DB UDL link-file: ADO
If you already have the required connection information stored using a ".udl" file (Universal Data Link), select this type and navigate to the file using the Select Data Link File dialog box.
ODBC Machine Data Source: DSN or ODBC File Data Source: DAO
These ODBC data access options can be used when the data source does not support an adequate OLE DB Provider. They enable you to connect with data sources that have file data source names (DSNs).
In general, it is recommended that you use the previously described ADO options rather than these.

In all cases when you have specified a valid connection and entered the required information, a new connection node is added in the upper-right Connections pane and you can to click this in future to open this connection.

Integration Querybuilder SQL Functions

The following SQL functions are supported in the Integration Querybuilder:

Concatenate Functions

Highlight multiple rows in the Columns pane, and select the appropriate function from the menu. The selected fields are converted to string data type and concatenated as defined in the following table.>p>

Table 2. Concatenate and Formatting Descriptions
Fld1+", "+...+", "+FldN
Creates a string by concatenating all selected fields and inserting a comma followed by a space between them. For example: "Name1, Name2, Name3"
Fld1+","+...+","+FldN
Creates a string by concatenating all selected fields and inserting a comma between them. For example: "Name1,Name2,Name3"
Fld1+" "+...+" "+FldN
Creates a string by concatenating all selected fields and inserting a space between them. For example: "Name1 Name2 Name3"
Fld1+...+FldN
Creates a string by concatenating all selected fields without separating them. For example: "Name1Name2Name3"
Fld1+"-"+...+"-"+FldN
Creates a string by concatenating all selected fields and inserting a hyphen between them. For example: "Name1-Name2-Name3"
Fld1+" - "+...+" - "+FldN
Creates a string by concatenating all selected fields and inserting a space followed by a hyphen followed by space between them. For example: "Name1 - Name2 - Name3"
Fld1+", "+Fld2+" "+Fld3
Creates a string by concatenating the three selected files and inserting a comma and a space between the first and second fields and a space between the second and third. Used for last name, first name and middle name (or initial) format. For example: "Name1, Name2 Name3"
Fld1+" "+Fld2+". "+Fld3
Creates a string by concatenating the three selected files and inserting a space between the first and second fields and a period followed by a space between the second and third. Used for first name, middle initial, last name format. For example: "Name1 N. Name3"

Mask Functions

Highlight a row from the Columns pane and select the appropriate function from the menu. The selected field is converted to string data type and formatted as described below:

Table 3. Masks and Formatting Descriptions
Phone [LO] : 1-999-999-9999
Converts the first 11 characters to a string using this phone number format.
Phone [LO] : 1 999-999-9999
Converts the first 11 characters to a string using this phone number format.
Phone [AC] : (999) 999-9999
Converts the first 10 characters to a string using this phone number format.
Phone [AC] : 999-999-9999
Converts the first 10 characters to a string using this phone number format.
Phone [AC] : 999.999.9999
Converts the first 10 characters to a string using this phone number format.
Phone [ACX]: (999) 999-9999 (ext 00000)
Converts the first 15 characters to a string using this phone number plus extension format.
Phone [SH] : 999-9999
Converts the first 7 characters to a string using this phone number format.
Area Code : (999)
Converts the first 3 characters to a string using this area code format.
Extension : (ext 00000)
Converts the first 5 characters to a string using this extension number format.
Extension : -X00000
Converts the first 5 characters to a string using this extension number format.
SS# : 999-99-9999
Converts the first 10 characters to a string using this social security number format.
Fed ID : 99-9999999
Converts the first 9 characters to a string using this federal ID number format.
Zip Code : 99999-9999
Converts the first 9 characters to a string using this zip code format.
Bank Acc : 99999-99999
Converts the first 10 characters to a string using this bank account number format.
Cred Card : 9999-9999-9999-9999
Converts the first 16 characters to a string using this credit card number format.
ISBN : 9-99-999999-9
Converts the first 10 characters to a string using this ISBN format.
ISBN : 9-999-99999-9
Converts the first 10 characters to a string using this ISBN format.
ISBN : 9-9999-9999-9
Converts the first 10 characters to a string using this ISBN format.
ISBN : 9-99999-999-9
Converts the first 10 characters to a string using this ISBN format.
ISBN : 9-999999-99-9
Converts the first 10 characters to a string using this ISBN format.

Aggregate Functions

Aggregate functions return a single value based on a set of other values, such as the sum of these values, including the maximum and minimum. The aggregate function can be applied to all the values in a particular field in the query, or to groups of these values by applying the Group By clause.

The aggregation functions available depend on the SQL variant and syntax used by the data source. For a full description of each function, refer to the vendor's documentation.

Other than COUNT, most aggregation functions are designed to work with numeric fields and unexpected results can occur if you try to aggregate strings.

String Functions

String functions operate on single text strings and are used to obtain and manipulate characters or character information, such as extracting a substring of one or more characters or trimming blank spaces from a field.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Refer to the vendor's help or documentation for a full description of these functions.

Numeric Functions

Numeric functions operate on single numeric fields and are used to obtain simple mathematical information, such as the sign of a number, the next highest integer or its absolute value.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Refer to the vendor's help or documentation for a full description of these functions.

Date Part Functions

Date Part functions operate on date fields and are used to extract and convert a portion of a date into a character string such as day of week, day of year, quarter, hour and minute.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Those shown may only be the most commonly used of those available. Refer to the vendor's help or documentation that was provided for a full description of these functions.

Date Name Functions

Date Name functions operate on date fields and are used to extract and convert a portion of a date into a named date element such as month name or weekday name.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Refer to the vendor's help or documentation for a full description of these functions.

Conversion to Character Functions

Conversion to Character functions convert numeric fields to text fields, then apply the selected formatting to enable different formats for currency and dates and times.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Refer to the vendor's help or documentation for a full description of these functions.

Conversion to Number Functions

Conversion to Number functions convert numbers held as text strings to their numeric values. Strings containing non-numeric values will be converted to 0.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Refer to the vendor's help or documentation for a full description of these functions.

Conversion to Date Functions

Conversion to Date functions convert numeric fields to their date equivalent in the selected format.

The functions available from this menu depend on the SQL variant and syntax used by the data source. Refer to the vendor's help or documentation for a full description of these functions.

NULL Case

This button inserts code that will deal with a field containing a NULL value. Different SQL data sources will handle this differently, but should allow you to substitute some other value into the field if a NULL value is encountered. For example, in Access it might be:

IIf(IsNull(Fld), vNullCaseValue, Fld)

Whereas in Oracle it might be:

NVL(Fld, vNullCaseValue)

In both cases you replace vNullCaseValue with the value to be displayed (or further processing to be performed) if NULL is returned.

Simple Case

This button inserts code that deals with a simple either/or condition. Different SQL data sources handle this differently. For example, in Access it might be:

IIf(Fld = when_expression_1, result_expression_1, else_expression)

Whereas in Oracle it might be:

DECODE(Fld, when_expression_1, result_expression_1, else_expression)

Working in the Relations Pane

Once you have connected to a data source, your connection is displayed in the upper-right pane. You can set up and save as many data source connections as you need. Integration Querybuilder stores all connections you create unless you explicitly delete them.

Creating Queries on the Relations Pane

To create new queries on the Relations pane:

  1. Double-click the connection you want to use. The DB Browser in the lower-right pane displays the database.
  2. Click the database icon to display the icons for tables, views and procedures for this database.
  3. Select New Query from the Query menu. A new query icon opens beneath the connection icon in the upper-right pane.
  4. Drag the tables and views you want to use into the Relations (upper left) pane. SELECT… FROM statements are created in the SQL pane. If tables are already linked in the database, these links are displayed, although these can be changed or removed for the purpose of this particular query.

    If you are using a table more than once, the second and further copies are renamed. For example, if you already have a Customer table in the Relations pane and you drag across another copy, it is automatically renamed Customer1.

    NOTE: Although views can be used for creating relationships if supported by the database you are using, for ease of reading, this section only refer to tables.

Creating a Join in the Relations Pane

When two tables each contain a column that shares a common set of values, they can be linked on this column by a join. For example, a table of orders and a table of customers could be linked if both contain a Customer ID column.

To create a join between two tables in the Relations pane:

  1. Click and drag the common column to the common column in the other table. A join line now connects the two columns and a JOIN statement inserted into the FROM clause in the SQL pane.

    If the fields you have connected were set as primary and foreign keys in the original database, "key" and "infinity" icons appear at either end of the join. If these icons do not appear it may be that the two columns do not contain common values.

    The default join between two tables is called an Inner Join. This means that the query only returns those records where the fields on both sides of the join match. Unmatched records are discarded.
  2. To change the join type, right-click the diamond icon in the middle of the join.

You can change the join to any of the following:

Available Joins and Their Descriptions

Left Outer Join
All records in the table on the left side of the join are returned, even if there is no matching record on the right side. To create, right-click the diamond icon and click the first Select All Rows From option.
Right Outer Join
All records in the table on the right side of the join are returned, even if there is no matching record on the left side. To create, right-click the diamond icon and click the second Select All Rows From… option.
Full Outer Join
If full joins are permitted by the database connection, then all records from both tables are returned, regardless of matching records in the other table. To select this, open the Join Line Properties dialog box by right-clicking the diamond icon and selecting Properties. Check both Include Rows boxes.

You can also create joins that are based on other logical conditions than the two selected columns being equal. Do this by right-clicking the diamond and selecting Properties. This is not recommended unless there is no other way to create a query and you know exactly what you are doing, as it can produce an extremely large number of records.

Selecting a Field in the Columns Pane

The Columns pane allows you to select the fields you want to use in your query. If field names are more than one word long they are displayed in the column cells surrounded by square brackets.

City
[Customer ID]

In the column list, fields are listed by the table they appear in and names are not bracketed.

Customers.City
Orders.Customer ID

An asterisk (*) is used to indicate all fields, so Employees.* means all fields in the Employees table.

To select a field in the Columns pane:

  1. Click in the first empty cell beneath Column. Click the arrow to display the list of fields.
  2. Scroll to the field you want to use and click it. The field name appears in the cell and in the SELECT statement in the SQL pane.
  3. If you want to use a different name for this field when the query is generated, type the new name in the Alias cell.
  4. Select or clear the Output check box.
  5. If you want to sort the records by this field, click in the Sort Type cell and select either Ascending or Descending.
  6. If you are sorting on more than one field, enter the sort order ranking for this field in the Sort Order cell. For example, if you were sorting by five fields, you would select "1" if this were the first field to sort on or "5" if it were the last.
  7. Select the Group By check box to group identical occurrences of this field.
  8. Specify the criterion, if any, to be applied to the Criteria cell. For example, if you had a Total Cost field and only wanted to return items where the total cost was more than $30.00 you would enter ">30.00" here. The code generated by your criteria will be displayed in the SQL pane as a WHERE clause.

The value you use in a Criteria cell can be preceded by one of the following operators: "=","<", ">", "<="," >=", "!<", "!>", "!=", "<>", "IS", "IS NOT", "NOT", "IN", "LIKE", "BETWEEN". If no operator is entered, the default is "=".

Values following the LIKE operator will be put into quotes unless already so. The LIKE operator enables you to use regular expressions.

The wildcards and syntax available for use in regular expressions will depend upon your source database, so you should refer to the documentation supplied for further information. They should not be confused with those used in RIFL regular expressions.

Each time you define a criterion, another criterion cell opens, enabling you to set up a string of OR conditions for this column. For example, to select records where the Employee ID is less than 10000, between 10500 and 10750 or greater than 20000:

Which will result in the following SQL clause:

WHERE
([Employee ID] < 10000 OR
[Employee ID] BETWEEN 10500 AND 10750 OR
[Employee ID] > 20000)

OR criteria are created horizontally, whereas AND criteria vertically. For example, to select records where the Employee ID is greater than 10000 and the Last Name is not Smith:

Which will result in the following SQL clause:

WHERE
([Employee ID] > 10000 AND
[Last Name] <> "Smith")

To set up an OR criteria that tests two separate fields, do the following:

Which will result in the following SQL clause:

WHERE
([Balance] > 400 OR
[Last Payment] > 100)

To combine OR and AND criteria, remember that criteria are grouped with the criteria in the same column. As in the following example:

WHERE
(Employee ID] > 10000 AND
[Last Name] = "Smith" OR
[Last Name] ="Jones")

[Employee ID] > 10000 is only used with [Last Name] = "Smith". Records will be returned when the [Last Name] ="Jones" regardless of employee ID.

To return all records where the customer ID is greater than 20000 and the company name is Smith, Jones or Featherstone you will need to set up the following criteria:

WHERE
([Employee ID] > 20000 AND
[Last Name] = "Smith" OR
[Employee ID] > 20000 AND
[Last Name] = "Jones" OR
[Employee ID] > 20000 AND
[Last Name] = "Featherstone")


Working in the SQL Pane

Integration Querybuilder's SQL pane displays the SQL statement as you create it using the Relations and Columns panes. However, it can also be used by more experienced users as a SQL editor to create queries that cannot be put together simply using the Relations and Columns panes.

Caution: If you are not familiar with writing SQL by hand, do not edit the code Querybuilder generates, as changes made in the SQL pane override the settings you created in the Relations and Columns panes, and you might have to begin the query again.

Incorporating SQL Functions

To use SQL functions:

  1. In the SQL Pane, click the cursor at the appropriate point in the SQL statement where you want the new function to be inserted. You can create new lined, press the TAB key and insert blank spaces as with any text editor to improve the readability of the code.
  2. In the Column pane, select the required field or fields. Hold down the CTRL key and click each field if you are using a function that concatenates multiple fields.
  3. Select the required function from the SQL functions menu.
  4. Caution: If you have not selected a field or fields in the Column pane, the field or fields highlighted in the Relations pane are used even if you have not selected this field or fields. If you select any code in the SQL Pane prior to selecting a function, this code is overwritten.
  5. If you have put a new statement after an existing statement in the SELECT section, insert a comma after the existing code. Because of the way Querybuilder validates code, you have to add the comma after you have created the function.
  6. The selected function, with the field name(s) inserted, is created at the cursor point.
  7. Test the validity of your SQL query by clicking the Check SQL Syntax icon.
    If the syntax is valid, a new line containing this function is created in the Columns pane. Enter a short, meaningful name for it in the Alias cell. If the syntax is invalid an error message displays. The Relations and Columns pane is grayed-out until the SQL code is corrected and validated again.

Defining SQL Manually

To manually define SQL:

If you are an experienced SQL user, type query statements directly into Integration Querybuilder using this pane as an editor and the Check SQL Syntax button to validate the code as you build it. You can also paste text into the SQL pane from external programs, online help and e-mails.

If you copy existing material into the SQL pane, you must remove any comments, as these will cause an error when the query is exported to Map Designer.

If you set up a join conditions in the SQL pane rather than in the Relations pane and the join condition has an expression instead of a field on one side, the join line (in the Relations pane) will is drawn from a field of one object to the title bar of the other. The diamond icon will display "f*". If both sides of the join condition are expressions rather than fields the join line is drawn between the title bars of both objects.

Querybuilder does not validate or run any query containing DJX. Test and validate your statement using dummy information and substitute the DJX code in Map Designer's Query Statement box afterwards.

To run the query:

  1. Click the Execute Query button.
    The Result tab displays showing the result of this query. If the query takes a noticeable time to execute, the Stop button becomes active so that the query can be canceled if required.
  2. Toggle between the Query and Result tabs of Integration Querybuilder if you need to fine-tune the query and check the result after each adjustment.
  3. Once you are satisfied with the query you have created, click OK (located at the bottom of the window). Save the query with a meaningful name.
  4. Integration Querybuilder closes and the generated SQL code is copied into the SQL Statement box. To see the code formatted, click the ellipsis to display it in a separate window.