Fit Specification: Parsing and Output

 

This portion of the Fit specification describes how Fit parses and outputs HTML documents.

 

Documents and Tables

Basic Parsing

 

Fit parses the tables from HTML documents.  For example, in the following table, raw HTML is shown on the left and Fit’s view of the HTML is shown on the right.  (Table cells are in brackets, rows are per line, and tables are separated by dashes.)

 

fat.ParseFixture

 

Html

Parse()

<table>

  <tr><td>1</td></tr>

</table>

[1]

<table>

  <tr><td>1</td>   <td>2</td></tr>

  <tr><td>3</td>   <td>4</td></tr>

</table>

[1] [2]

[3] [4]

<table>

  <tr><td>1</td>   <td>2</td></tr>

  <tr><td>3</td>   <td>4</td></tr>

</table>

<table>

  <tr><td>5</td></tr>

  <tr><td>6</td></tr>

</table>

[1] [2]

[3] [4]

----

[5]

[6]

 

Other HTML

 

Everything but table structure and cell contents are ignored.  The ignored portions are preserved so they can be output again later, exactly as they were read in.

 

fat.ParseFixture

 

 

Html

Parse()

Output()

<html>

<body>Text before table...

<table>

  <tr><td>1</td></tr>

</table>

Text after table...</body>

</html>

[1]

<html>

<body>Text before table...

<table>

  <tr><td>1</td></tr>

</table>

Text after table...</body>

</html>

<table>

Text in table

<tr>

  Text in row

  <td>Text in cell</td>

  more row

</tr>

more table</table>

[Text in cell]

<table>

Text in table

<tr>

  Text in row

  <td>Text in cell</td>

  more row

</tr>

more table</table>

<table cellpadding=”3”>

  <tr attribute=”yes”><td align=”top”>Cell</td></tr>

</table>

[Cell]

<table cellpadding=”3”>

  <tr attribute=”yes”><td align=”top”>Cell</td></tr>

</table>

 

Even whitespace is preserved.

 

fat.ParseFixture

 

 

Html

Parse()

Output()

<html><body><table>

  <tr><td>1</td></tr>

</table></body></html>

[1]

<html><body><table>

  <tr><td>1</td></tr>

</table></body></html>

<html>

  <body>

    <table>

      <tr>

        <td>1</td>

      </tr>

    </table>

  </body>

</html>

[1]

<html>

  <body>

    <table>

      <tr>

        <td>1</td>

      </tr>

    </table>

  </body>

</html>

 

Complicated Tables

 

The colspan and rowspan attributes of table cells are also ignored, but jagged tables (tables with a varying number of cells in each row) are okay:

fat.ParseFixture

 

Html

Parse()

<table>

  <tr><td>1</td></tr>

  <tr><td>2</td>   <td>3</td>   <td>4</td></tr>

  <tr><td>5</td>   <td>6</td></tr>

</table>

[1]

[2] [3] [4]

[5] [6]

<table>

  <tr><td rowspan=2>1</td>   <td>2</td>   <td>3</td></tr>

  <tr><td colspan=2>4</td>   <td>5</td></tr>

</table>

[1] [2] [3]

[4] [5]

 

Malformed HTML

 

Tables that are missing tags generate an error.

 

fat.ParseFixture

 

 

Html

Parse()

Note

<table>

  <tr><td>1</td>

</table>

error

no ending <tr> tag

<tr><td>1</td></tr>

error

no <table> tag

<table>

  <td>1</td>

</table>

error

no <tr> tag

<table>

  <tr><td>1</tr>

</table>

error

no ending </td> tag

 

Tables containing excess tags do not generate an error; the excess tags are ignored.

 

fat.ParseFixture

 

Html

Parse()

<table>

  <tr><td>1</td></tr>

  <table>

  <tr><td>2</td></tr>

</table>

[1]

[2]

 

HTML mistakes that aren’t related to tables are ignored.

 

fat.ParseFixture

 

Html

Parse()

<table>

  <tr><badTag...<td>1</td></tr>

</table>

[1]

 

Cells

 

When Fit parses a table, it converts the contents of each cell into a string.

 

Whitespace

 

Leading and trailing whitespace are removed.  The &nbsp; entity and non-breaking space character (represented here with “\u00a0”) are considered whitespace when removing leading and trailing whitespace.  Tags other than line-break tags are ignored.

 

fat.ParseFixture

 

TableCell

Parse()

<td>     a    </td>     

[a]

<td>

 

   a  

 

</td>

[a]

<td>a&nbsp;      </td>

[a]

<td>  a &nbsp;</td>

[a]

<td>\u00a0 a \u00a0</td>

[a]

<td>    <tag />&nbsp;a</td>

[a]

 

Adjoining whitespace is combined into a single space.  The &nbsp; entity and non-breaking space character are not considered whitespace when combining whitespace.  Tags other than line-break tags are ignored.

 

fat.ParseFixture

 

TableCell

Parse()

<td>1   +

 

2</td>

[1 + 2]

<td>1   <tag />    2</td>

[1 2]

<td>1 &nbsp;&nbsp;&nbsp;2</td>

[1    2]

<td>1 \u00a0\u00a0\u00a02</td>

[1    2]

 

Character Conversion

 

These specific HTML entities are converted into characters:

 

fat.ParseFixture

 

Entity

Parse()

&amp;

[&]

(&nbsp;)

[( )]

&lt;

[<]

&gt;

[>]

&quot;

["]

 

The non-breaking space character is converted into a normal space.

 

fat.ParseFixture

 

Entity

Parse()

(\u00a0)

[( )]

 

Extended characters are preserved as-is.

 

fat.ParseFixture

 

Entity

Parse()

ñ

[ñ]

 

Line break tags are converted into ASCII 10 line feed characters (shown here as “\n”).

 

fat.ParseFixture

 

TableCell

Parse()

<td>intentional<br>line-break</td>

[intentional\nline-break]

<td>another form<br />of line-break</td>

[another form\nof line-break]

<td>yet<br/>more<br />forms<  br   /   ></td>

[yet\nmore\nforms\n]

 

Microsoft Word

 

Fit has a few special parsing rules for HTML created by Microsoft Word.

 

“Smart quotes” are converted to regular quotes.

 

fat.ParseFixture

 

TableCell

Parse()

<td>“double-quotes” </td>

["double-quotes"]

<td>‘single quotes’</td>

['single quotes']

 

Word’s use of paragraph tags for line breaks is supported.

 

fat.ParseFixture

 

TableCell

Parse()

<td><p>Line breaks</p> <p>in Word</p></td>

[Line breaks\nin Word]

<td><p>Alternative line<   /   p   ><  p >breaks</td>

[Alternative line\nbreaks]

 

Other HTML

 

Other HTML markup is ignored.

 

fat.ParseFixture

 

TableCell

Parse()

<td><b>text</b></td>

[text]

<td>

  a more <i>complicated

  <spell check=”true”>example</spell></i></td>

[a more complicated example]

 

Cell Output

 

When the contents of a table cell are changed, Fit does some of the above conversions in reverse.

 

Some characters are turned into entities.

 

fat.OutputFixture

 

Text

CellOutput()

<

<td>&lt;</td>

&

<td>&amp;</td>

 

ASCII line feed codes are turned into HTML break tags.  (In these examples, “\n” is ASCII 10, “new line”, and “\r” is ASCII 13, “carriage return.”)

 

fat.OutputFixture

 

Text

CellOutput()

Unix \n line feed

<td>Unix <br /> line feed</td>

Mac \r line feed

<td>Mac <br /> line feed</td>

DOS \r\n line feed

<td>DOS <br /> line feed</td>

Backwards \n\r line feed

<td>Backwards <br /> line feed</td>

 

Multiple adjoining spaces are turned into &nbsp; entities.

 

fat.OutputFixture

 

Text

CellOutput()

1     2

<td>1 &nbsp; &nbsp; 2</td>

 

Errata

 

Known errors and omissions (fix me!):

 

 

Possible Improvements

 

 

Back to Index

 

 

 

 

fit.Summary

counts 54 right, 0 wrong, 0 ignored, 0 exceptions
input file C:\projects\fit\imp\java\..\..\spec\parse.html
input update Wed Jul 21 00:30:07 PDT 2004
output file C:\projects\fit\imp\java\output\spec\parse.html
run date Tue Aug 31 16:56:41 PDT 2004
run elapsed time 0:00.22