Extension:External Data/Parsing data

The data retrieved from web pages, local files, local programs and inline text is all usually expected to be in a certain structured format. The following parameters can all be used in any of these calls to help parse these. Some can be used only for specific formats, while others are valid across all formats.

Cross-format parameters

edit
  • |format= - specifies the format of the data being retrieved: it should be one of 'CSV', 'CSV with header', 'GFF', 'JSON', 'JSON with JSONpath', 'YAML', 'YAML with JSONpath', 'XML', 'XML with XPath', 'HTML', 'HTML with XPath', 'ini' or 'text'. CSV, JSON, YAML, and XML are standard data formats; GFF, or the Generic Feature Format, is a format for genomic data. The difference between 'CSV' and 'CSV with header' is that 'CSV' is simply a set of lines with values; while in 'CSV with header', the first line is a "header", holding a comma-separated list of the name of each column. 'text' indicates that the contents of the file should be retrieved as-is. 'ini' format refers to key-value pairs (see below).
    If |format=auto is set, the extension will attempt (usually successfully) to determine the actual data format, based on file or URL extension, some other arguments to the parser function and the parsed content itself.
  • |start line=, |end line=, |header lines=, |footer lines= - use these to cut out a fragment of data. Line number are one-based, negative values (-1 meaning last) are possible as well as percentages (0% to 100%). Use |header lines= and |footer lines= to carve out a valid CSV, JSON or XML. Note that if any of these is set, additional newlines will be injected into XML or JSON to guarantee that required tag/variable blocks begin and end at new lines, which will influence the required |start line= and |end line= settings. The external variables __start and __end store the beginning and end of the main fragment (without header or footer), __lines contain the number of lines returned and __total — total number of lines in the file.

Format-specific parameters

edit

CSV and INI

edit
  • |delimiter= -
    • in CSV format, specifies the delimiter between values in the data set. The default value is ",". To specify a tab delimiter, use "\t".
    • in INI format, specifies the delimiter between key and value; by default, =.
If |delimiter=auto, |delimiter=detect, |delimiter=autodetect or |delimiter= is not set at all, the extension will automatically choose one (trying ;, ,, tabulation and | for CSV and = and : for INI. The delimiter that gives the evenest and widest table will be chosen.

Only CSV

edit
  • |with header or |header=yes determines whether the CSV file has a header line. If it has, it will not be parsed as data, but the column headers will be used, instead of numbers, as variable names. If |header=auto, |header=detect or |header=autodetect is set, the extension will try to determine, whether the first line is likely to contain column headers, by analysing its contents and comparing it with the second line and taking into account the external variable names (all numeric or special or not) from the |data= parameter.

Only text

edit
  • |regex= - specifies a PHP (PERL-compatible) regular expression that should be used to get specific strings; used with the "text" format. Example: For sample text <h1>Heading</h1>, the regex |regex=/<h1>(?'matched'.*)<\/h1>/ returns "Heading" to the external variable matched.

XML and HTML

edit
  • |use xpath - an optional parameter that can be used with the "XML" or "HTML" formats, to indicate that "data" mappings should be done using XPath notation. This is especially useful if the same tag or attribute name is used more than once in the file, and you only want to get a specific instance of it. We won't get into the details of XPath notation here, but you can see a demonstration of "use xpath" here.
    • |default xmlns prefix= - an optional parameter that can be used with "use xpath", which sets the default namespace prefix to be used.

JSON and YAML

edit
  • |use jsonpath - an optional parameter that can be used with the "JSON" and "YAML" formats, to indicate that "data" mappings should be done using JSONPath notation. JSONPath is less well-known than XPath, but documentation for it does exist: see here for one guide to JSONPath syntax, and here for an online evaluator of JSONPath syntax.

Only JSON

edit
  • |json offset= - an optional parameter that represents the number of characters to ignore at the beginning of the data set being parsed. It is used with JSON values, in case the JSON being accessed has some kind of security string at the beginning.
  • |allow trailing commas= - if this is set, JSON files with commas before ] or } will be parsed even though JSON specification does not allow trailing commas. This setting is useful when |start line=, |end line=, |header lines= and |footer lines= are set.

Only INI

edit
  • |comment delimiter= - for the "ini" format, contains the character or string used at the beginning of comment lines (by default, # and ;).
  • |invalid as comments= - for the "ini" format, determines whether a line, not parsable as a configuration setting or comment, should be treated as a comment (some INI formats may have section headings like [Section].

Parsing XML and HTML

edit

For data from XML and HTML sources, the variable names are determined by both tag and attribute names. For example, given the following XML text:

<fruit type="Apple"><color>red</color></fruit>

the variable type would have the value Apple, and the variable color would have the value red.

Similarly, the following XML text would be interpreted as a table of values defining two variables named type and color:

<fruits>
  <fruit type="Apple"><color>red</color></fruit>
  <fruit type="Kiwi"><color>brown</color></fruit>
</fruits>

For more complex XML structures, it may make sense to use XPath to retrieve values; see "use xpath", above.

Using CSS-style selectors

edit

With the "HTML" format, you can either use XPath (see above) or CSS-style selectors. For CSS-style selection, you do not need to specify a special parameter: it is the default approach used when "use xpath" is not specified. CSS selectors are a notation that uses tag names, classes and IDs to locate one or more elements in an HTML page; it is also the syntax used in jQuery. See here for one reference for CSS-style selectors.

INI text example

edit

The "ini" format refers to INI files; here is a short example of such a file:

error_reporting = E_ALL | E_ERROR | E_WARNING | E_PARSE | E_CORE_ERROR | E_CORE_WARNING | E_COMPILE_ERROR | E_COMPILE_WARNING | E_USER_ERROR | E_USER_WARNING | E_USER_NOTICE; 
display_errors = On;
display_startup_errors = On;