|  | .TH HTML 3 | 
|  | .SH NAME | 
|  | parsehtml, | 
|  | printitems, | 
|  | validitems, | 
|  | freeitems, | 
|  | freedocinfo, | 
|  | dimenkind, | 
|  | dimenspec, | 
|  | targetid, | 
|  | targetname, | 
|  | fromStr, | 
|  | toStr | 
|  | \- HTML parser | 
|  | .SH SYNOPSIS | 
|  | .nf | 
|  | .PP | 
|  | .ft L | 
|  | #include <u.h> | 
|  | #include <libc.h> | 
|  | #include <html.h> | 
|  | .ft P | 
|  | .PP | 
|  | .ta \w'\fLToken* 'u | 
|  | .B | 
|  | Item*	parsehtml(uchar* data, int datalen, Rune* src, int mtype, | 
|  | .B | 
|  | int chset, Docinfo** pdi) | 
|  | .PP | 
|  | .B | 
|  | void	printitems(Item* items, char* msg) | 
|  | .PP | 
|  | .B | 
|  | int	validitems(Item* items) | 
|  | .PP | 
|  | .B | 
|  | void	freeitems(Item* items) | 
|  | .PP | 
|  | .B | 
|  | void	freedocinfo(Docinfo* d) | 
|  | .PP | 
|  | .B | 
|  | int	dimenkind(Dimen d) | 
|  | .PP | 
|  | .B | 
|  | int	dimenspec(Dimen d) | 
|  | .PP | 
|  | .B | 
|  | int	targetid(Rune* s) | 
|  | .PP | 
|  | .B | 
|  | Rune*	targetname(int targid) | 
|  | .PP | 
|  | .B | 
|  | uchar*	fromStr(Rune* buf, int n, int chset) | 
|  | .PP | 
|  | .B | 
|  | Rune*	toStr(uchar* buf, int n, int chset) | 
|  | .SH DESCRIPTION | 
|  | .PP | 
|  | This library implements a parser for HTML 4.0 documents. | 
|  | The parsed HTML is converted into an intermediate representation that | 
|  | describes how the formatted HTML should be laid out. | 
|  | .PP | 
|  | .I Parsehtml | 
|  | parses an entire HTML document contained in the buffer | 
|  | .I data | 
|  | and having length | 
|  | .IR datalen . | 
|  | The URL of the document should be passed in as | 
|  | .IR src . | 
|  | .I Mtype | 
|  | is the media type of the document, which should be either | 
|  | .B TextHtml | 
|  | or | 
|  | .BR TextPlain . | 
|  | The character set of the document is described in | 
|  | .IR chset , | 
|  | which can be one of | 
|  | .BR US_Ascii , | 
|  | .BR ISO_8859_1 , | 
|  | .B UTF_8 | 
|  | or | 
|  | .BR Unicode . | 
|  | The return value is a linked list of | 
|  | .B Item | 
|  | structures, described in detail below. | 
|  | As a side effect, | 
|  | .BI * pdi | 
|  | is set to point to a newly created | 
|  | .B Docinfo | 
|  | structure, containing information pertaining to the entire document. | 
|  | .PP | 
|  | The library expects two allocation routines to be provided by the | 
|  | caller, | 
|  | .B emalloc | 
|  | and | 
|  | .BR erealloc . | 
|  | These routines are analogous to the standard malloc and realloc routines, | 
|  | except that they should not return if the memory allocation fails. | 
|  | In addition, | 
|  | .B emalloc | 
|  | is required to zero the memory. | 
|  | .PP | 
|  | For debugging purposes, | 
|  | .I printitems | 
|  | may be called to display the contents of an item list; individual items may | 
|  | be printed using the | 
|  | .B %I | 
|  | print verb, installed on the first call to | 
|  | .IR parsehtml . | 
|  | .I validitems | 
|  | traverses the item list, checking that all of the pointers are valid. | 
|  | It returns | 
|  | .B 1 | 
|  | is everything is ok, and | 
|  | .B 0 | 
|  | if an error was found. | 
|  | Normally, one would not call these routines directly. | 
|  | Instead, one sets the global variable | 
|  | .I dbgbuild | 
|  | and the library calls them automatically. | 
|  | One can also set | 
|  | .IR warn , | 
|  | to cause the library to print a warning whenever it finds a problem with the | 
|  | input document, and | 
|  | .IR dbglex , | 
|  | to print debugging information in the lexer. | 
|  | .PP | 
|  | When an item list is finished with, it should be freed with | 
|  | .IR freeitems . | 
|  | Then, | 
|  | .I freedocinfo | 
|  | should be called on the pointer returned in | 
|  | .BI * pdi\f1. | 
|  | .PP | 
|  | .I Dimenkind | 
|  | and | 
|  | .I dimenspec | 
|  | are provided to interpret the | 
|  | .B Dimen | 
|  | type, as described in the section | 
|  | .IR "Dimension Specifications" . | 
|  | .PP | 
|  | Frame target names are mapped to integer ids via a global, permanent mapping. | 
|  | To find the value for a given name, call | 
|  | .IR targetid , | 
|  | which allocates a new id if the name hasn't been seen before. | 
|  | The name of a given, known id may be retrieved using | 
|  | .IR targetname . | 
|  | The library predefines | 
|  | .BR FTtop , | 
|  | .BR FTself , | 
|  | .B FTparent | 
|  | and | 
|  | .BR FTblank . | 
|  | .PP | 
|  | The library handles all text as Unicode strings (type | 
|  | .BR Rune* ). | 
|  | Character set conversion is provided by | 
|  | .I fromStr | 
|  | and | 
|  | .IR toStr . | 
|  | .I FromStr | 
|  | takes | 
|  | .I n | 
|  | Unicode characters from | 
|  | .I buf | 
|  | and converts them to the character set described by | 
|  | .IR chset . | 
|  | .I ToStr | 
|  | takes | 
|  | .I n | 
|  | bytes from | 
|  | .IR buf , | 
|  | interpretted as belonging to character set | 
|  | .IR chset , | 
|  | and converts them to a Unicode string. | 
|  | Both routines null-terminate the result, and use | 
|  | .B emalloc | 
|  | to allocate space for it. | 
|  | .SS Items | 
|  | The return value of | 
|  | .I parsehtml | 
|  | is a linked list of variant structures, | 
|  | with the generic portion described by the following definition: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Genattr* 'u | 
|  | typedef struct Item Item; | 
|  | struct Item | 
|  | { | 
|  | Item*	next; | 
|  | int	width; | 
|  | int	height; | 
|  | int	ascent; | 
|  | int	anchorid; | 
|  | int	state; | 
|  | Genattr*	genattr; | 
|  | int	tag; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | The field | 
|  | .B next | 
|  | points to the successor in the linked list of items, while | 
|  | .BR width , | 
|  | .BR height , | 
|  | and | 
|  | .B ascent | 
|  | are intended for use by the caller as part of the layout process. | 
|  | .BR Anchorid , | 
|  | if non-zero, gives the integer id assigned by the parser to the anchor that | 
|  | this item is in (see section | 
|  | .IR Anchors ). | 
|  | .B State | 
|  | is a collection of flags and values described as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'IFindentshift = 'u | 
|  | enum | 
|  | { | 
|  | IFbrk =	0x80000000, | 
|  | IFbrksp =	0x40000000, | 
|  | IFnobrk =	0x20000000, | 
|  | IFcleft =	0x10000000, | 
|  | IFcright =	0x08000000, | 
|  | IFwrap =	0x04000000, | 
|  | IFhang =	0x02000000, | 
|  | IFrjust =	0x01000000, | 
|  | IFcjust =	0x00800000, | 
|  | IFsmap =	0x00400000, | 
|  | IFindentshift =	8, | 
|  | IFindentmask =	(255<<IFindentshift), | 
|  | IFhangmask =	255 | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B IFbrk | 
|  | is set if a break is to be forced before placing this item. | 
|  | .B IFbrksp | 
|  | is set if a 1 line space should be added to the break (in which case | 
|  | .B IFbrk | 
|  | is also set). | 
|  | .B IFnobrk | 
|  | is set if a break is not permitted before the item. | 
|  | .B IFcleft | 
|  | is set if left floats should be cleared (that is, if the list of pending left floats should be placed) | 
|  | before this item is placed, and | 
|  | .B IFcright | 
|  | is set for right floats. | 
|  | In both cases, IFbrk is also set. | 
|  | .B IFwrap | 
|  | is set if the line containing this item is allowed to wrap. | 
|  | .B IFhang | 
|  | is set if this item hangs into the left indent. | 
|  | .B IFrjust | 
|  | is set if the line containing this item should be right justified, | 
|  | and | 
|  | .B IFcjust | 
|  | is set for center justified lines. | 
|  | .B IFsmap | 
|  | is used to indicate that an image is a server-side map. | 
|  | The low 8 bits, represented by | 
|  | .BR IFhangmask , | 
|  | indicate the current hang into left indent, in tenths of a tabstop. | 
|  | The next 8 bits, represented by | 
|  | .B IFindentmask | 
|  | and | 
|  | .BR IFindentshift , | 
|  | indicate the current indent in tab stops. | 
|  | .PP | 
|  | The field | 
|  | .B genattr | 
|  | is an optional pointer to an auxiliary structure, described in the section | 
|  | .IR "Generic Attributes" . | 
|  | .PP | 
|  | Finally, | 
|  | .B tag | 
|  | describes which variant type this item has. | 
|  | It can have one of the values | 
|  | .BR Itexttag , | 
|  | .BR Iruletag , | 
|  | .BR Iimagetag , | 
|  | .BR Iformfieldtag , | 
|  | .BR Itabletag , | 
|  | .B Ifloattag | 
|  | or | 
|  | .BR Ispacertag . | 
|  | For each of these values, there is an additional structure defined, which | 
|  | includes Item as an unnamed initial substructure, and then defines additional | 
|  | fields. | 
|  | .PP | 
|  | Items of type | 
|  | .B Itexttag | 
|  | represent a piece of text, using the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Rune* 'u | 
|  | struct Itext | 
|  | { | 
|  | Item; | 
|  | Rune*	s; | 
|  | int	fnt; | 
|  | int	fg; | 
|  | uchar	voff; | 
|  | uchar	ul; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Here | 
|  | .B s | 
|  | is a null-terminated Unicode string of the actual characters making up this text item, | 
|  | .B fnt | 
|  | is the font number (described in the section | 
|  | .IR "Font Numbers" ), | 
|  | and | 
|  | .B fg | 
|  | is the RGB encoded color for the text. | 
|  | .B Voff | 
|  | measures the vertical offset from the baseline; subtract | 
|  | .B Voffbias | 
|  | to get the actual value (negative values represent a displacement down the page). | 
|  | The field | 
|  | .B ul | 
|  | is the underline style: | 
|  | .B ULnone | 
|  | if no underline, | 
|  | .B ULunder | 
|  | for conventional underline, and | 
|  | .B ULmid | 
|  | for strike-through. | 
|  | .PP | 
|  | Items of type | 
|  | .B Iruletag | 
|  | represent a horizontal rule, as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Dimen 'u | 
|  | struct Irule | 
|  | { | 
|  | Item; | 
|  | uchar	align; | 
|  | uchar	noshade; | 
|  | int	size; | 
|  | Dimen	wspec; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Here | 
|  | .B align | 
|  | is the alignment specification (described in the corresponding section), | 
|  | .B noshade | 
|  | is set if the rule should not be shaded, | 
|  | .B size | 
|  | is the height of the rule (as set by the size attribute), | 
|  | and | 
|  | .B wspec | 
|  | is the desired width (see section | 
|  | .IR "Dimension Specifications" ). | 
|  | .PP | 
|  | Items of type | 
|  | .B Iimagetag | 
|  | describe embedded images, for which the following structure is defined: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Iimage* 'u | 
|  | struct Iimage | 
|  | { | 
|  | Item; | 
|  | Rune*	imsrc; | 
|  | int	imwidth; | 
|  | int	imheight; | 
|  | Rune*	altrep; | 
|  | Map*	map; | 
|  | int	ctlid; | 
|  | uchar	align; | 
|  | uchar	hspace; | 
|  | uchar	vspace; | 
|  | uchar	border; | 
|  | Iimage*	nextimage; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Here | 
|  | .B imsrc | 
|  | is the URL of the image source, | 
|  | .B imwidth | 
|  | and | 
|  | .BR imheight , | 
|  | if non-zero, contain the specified width and height for the image, | 
|  | and | 
|  | .B altrep | 
|  | is the text to use as an alternative to the image, if the image is not displayed. | 
|  | .BR Map , | 
|  | if set, points to a structure describing an associated client-side image map. | 
|  | .B Ctlid | 
|  | is reserved for use by the application, for handling animated images. | 
|  | .B Align | 
|  | encodes the alignment specification of the image. | 
|  | .B Hspace | 
|  | contains the number of pixels to pad the image with on either side, and | 
|  | .B Vspace | 
|  | the padding above and below. | 
|  | .B Border | 
|  | is the width of the border to draw around the image. | 
|  | .B Nextimage | 
|  | points to the next image in the document (the head of this list is | 
|  | .BR Docinfo.images ). | 
|  | .PP | 
|  | For items of type | 
|  | .BR Iformfieldtag , | 
|  | the following structure is defined: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Formfield* 'u | 
|  | struct Iformfield | 
|  | { | 
|  | Item; | 
|  | Formfield*	formfield; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | This adds a single field, | 
|  | .BR formfield , | 
|  | which points to a structure describing a field in a form, described in section | 
|  | .IR Forms . | 
|  | .PP | 
|  | For items of type | 
|  | .BR Itabletag , | 
|  | the following structure is defined: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Table* 'u | 
|  | struct Itable | 
|  | { | 
|  | Item; | 
|  | Table*	table; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Table | 
|  | points to a structure describing the table, described in the section | 
|  | .IR Tables . | 
|  | .PP | 
|  | For items of type | 
|  | .BR Ifloattag , | 
|  | the following structure is defined: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Ifloat* 'u | 
|  | struct Ifloat | 
|  | { | 
|  | Item; | 
|  | Item*	item; | 
|  | int	x; | 
|  | int	y; | 
|  | uchar	side; | 
|  | uchar	infloats; | 
|  | Ifloat*	nextfloat; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | The | 
|  | .B item | 
|  | points to a single item (either a table or an image) that floats (the text of the | 
|  | document flows around it), and | 
|  | .B side | 
|  | indicates the margin that this float sticks to; it is either | 
|  | .B ALleft | 
|  | or | 
|  | .BR ALright . | 
|  | .B X | 
|  | and | 
|  | .B y | 
|  | are reserved for use by the caller; these are typically used for the coordinates | 
|  | of the top of the float. | 
|  | .B Infloats | 
|  | is used by the caller to keep track of whether it has placed the float. | 
|  | .B Nextfloat | 
|  | is used by the caller to link together all of the floats that it has placed. | 
|  | .PP | 
|  | For items of type | 
|  | .BR Ispacertag , | 
|  | the following structure is defined: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Item; 'u | 
|  | struct Ispacer | 
|  | { | 
|  | Item; | 
|  | int	spkind; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Spkind | 
|  | encodes the kind of spacer, and may be one of | 
|  | .B ISPnull | 
|  | (zero height and width), | 
|  | .B ISPvline | 
|  | (takes on height and ascent of the current font), | 
|  | .B ISPhspace | 
|  | (has the width of a space in the current font) and | 
|  | .B ISPgeneral | 
|  | (for all other purposes, such as between markers and lists). | 
|  | .SS Generic Attributes | 
|  | .PP | 
|  | The genattr field of an item, if non-nil, points to a structure that holds | 
|  | the values of attributes not specific to any particular | 
|  | item type, as they occur on a wide variety of underlying HTML tags. | 
|  | The structure is as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'SEvent* 'u | 
|  | typedef struct Genattr Genattr; | 
|  | struct Genattr | 
|  | { | 
|  | Rune*	id; | 
|  | Rune*	class; | 
|  | Rune*	style; | 
|  | Rune*	title; | 
|  | SEvent*	events; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Fields | 
|  | .BR id , | 
|  | .BR class , | 
|  | .B style | 
|  | and | 
|  | .BR title , | 
|  | when non-nil, contain values of correspondingly named attributes of the HTML tag | 
|  | associated with this item. | 
|  | .B Events | 
|  | is a linked list of events (with corresponding scripted actions) associated with the item: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'SEvent* 'u | 
|  | typedef struct SEvent SEvent; | 
|  | struct SEvent | 
|  | { | 
|  | SEvent*	next; | 
|  | int	type; | 
|  | Rune*	script; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Here, | 
|  | .B next | 
|  | points to the next event in the list, | 
|  | .B type | 
|  | is one of | 
|  | .BR SEonblur , | 
|  | .BR SEonchange , | 
|  | .BR SEonclick , | 
|  | .BR SEondblclick , | 
|  | .BR SEonfocus , | 
|  | .BR SEonkeypress , | 
|  | .BR SEonkeyup , | 
|  | .BR SEonload , | 
|  | .BR SEonmousedown , | 
|  | .BR SEonmousemove , | 
|  | .BR SEonmouseout , | 
|  | .BR SEonmouseover , | 
|  | .BR SEonmouseup , | 
|  | .BR SEonreset , | 
|  | .BR SEonselect , | 
|  | .B SEonsubmit | 
|  | or | 
|  | .BR SEonunload , | 
|  | and | 
|  | .B script | 
|  | is the text of the associated script. | 
|  | .SS Dimension Specifications | 
|  | .PP | 
|  | Some structures include a dimension specification, used where | 
|  | a number can be followed by a | 
|  | .B % | 
|  | or a | 
|  | .B * | 
|  | to indicate | 
|  | percentage of total or relative weight. | 
|  | This is encoded using the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'int 'u | 
|  | typedef struct Dimen Dimen; | 
|  | struct Dimen | 
|  | { | 
|  | int	kindspec; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Separate kind and spec values are extracted using | 
|  | .I dimenkind | 
|  | and | 
|  | .IR dimenspec . | 
|  | .I Dimenkind | 
|  | returns one of | 
|  | .BR Dnone , | 
|  | .BR Dpixels , | 
|  | .B Dpercent | 
|  | or | 
|  | .BR Drelative . | 
|  | .B Dnone | 
|  | means that no dimension was specified. | 
|  | In all other cases, | 
|  | .I dimenspec | 
|  | should be called to find the absolute number of pixels, the percentage of total, | 
|  | or the relative weight. | 
|  | .SS Background Specifications | 
|  | .PP | 
|  | It is possible to set the background of the entire document, and also | 
|  | for some parts of the document (such as tables). | 
|  | This is encoded as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Rune* 'u | 
|  | typedef struct Background Background; | 
|  | struct Background | 
|  | { | 
|  | Rune*	image; | 
|  | int	color; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .BR Image , | 
|  | if non-nil, is the URL of an image to use as the background. | 
|  | If this is nil, | 
|  | .B color | 
|  | is used instead, as the RGB value for a solid fill color. | 
|  | .SS Alignment Specifications | 
|  | .PP | 
|  | Certain items have alignment specifiers taken from the following | 
|  | enumerated type: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n | 
|  | enum | 
|  | { | 
|  | ALnone = 0, ALleft, ALcenter, ALright, ALjustify, | 
|  | ALchar, ALtop, ALmiddle, ALbottom, ALbaseline | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | These values correspond to the various alignment types named in the HTML 4.0 | 
|  | standard. | 
|  | If an item has an alignment of | 
|  | .B ALleft | 
|  | or | 
|  | .BR ALright , | 
|  | the library automatically encapsulates it inside a float item. | 
|  | .PP | 
|  | Tables, and the various rows, columns and cells within them, have a more | 
|  | complex alignment specification, composed of separate vertical and | 
|  | horizontal alignments: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'uchar 'u | 
|  | typedef struct Align Align; | 
|  | struct Align | 
|  | { | 
|  | uchar	halign; | 
|  | uchar	valign; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Halign | 
|  | can be one of | 
|  | .BR ALnone , | 
|  | .BR ALleft , | 
|  | .BR ALcenter , | 
|  | .BR ALright , | 
|  | .B ALjustify | 
|  | or | 
|  | .BR ALchar . | 
|  | .B Valign | 
|  | can be one of | 
|  | .BR ALnone , | 
|  | .BR ALmiddle , | 
|  | .BR ALbottom , | 
|  | .BR ALtop | 
|  | or | 
|  | .BR ALbaseline . | 
|  | .SS Font Numbers | 
|  | .PP | 
|  | Text items have an associated font number (the | 
|  | .B fnt | 
|  | field), which is encoded as | 
|  | .BR style*NumSize+size . | 
|  | Here, | 
|  | .B style | 
|  | is one of | 
|  | .BR FntR , | 
|  | .BR FntI , | 
|  | .B FntB | 
|  | or | 
|  | .BR FntT , | 
|  | for roman, italic, bold and typewriter font styles, respectively, and size is | 
|  | .BR Tiny , | 
|  | .BR Small , | 
|  | .BR Normal , | 
|  | .B Large | 
|  | or | 
|  | .BR Verylarge . | 
|  | The total number of possible font numbers is | 
|  | .BR NumFnt , | 
|  | and the default font number is | 
|  | .B DefFnt | 
|  | (which is roman style, normal size). | 
|  | .SS Document Info | 
|  | .PP | 
|  | Global information about an HTML page is stored in the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'DestAnchor* 'u | 
|  | typedef struct Docinfo Docinfo; | 
|  | struct Docinfo | 
|  | { | 
|  | // stuff from HTTP headers, doc head, and body tag | 
|  | Rune*	src; | 
|  | Rune*	base; | 
|  | Rune*	doctitle; | 
|  | Background	background; | 
|  | Iimage*	backgrounditem; | 
|  | int	text; | 
|  | int	link; | 
|  | int	vlink; | 
|  | int	alink; | 
|  | int	target; | 
|  | int	chset; | 
|  | int	mediatype; | 
|  | int	scripttype; | 
|  | int	hasscripts; | 
|  | Rune*	refresh; | 
|  | Kidinfo*	kidinfo; | 
|  | int	frameid; | 
|  |  | 
|  | // info needed to respond to user actions | 
|  | Anchor*	anchors; | 
|  | DestAnchor*	dests; | 
|  | Form*	forms; | 
|  | Table*	tables; | 
|  | Map*	maps; | 
|  | Iimage*	images; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Src | 
|  | gives the URL of the original source of the document, | 
|  | and | 
|  | .B base | 
|  | is the base URL. | 
|  | .B Doctitle | 
|  | is the document's title, as set by a | 
|  | .B <title> | 
|  | element. | 
|  | .B Background | 
|  | is as described in the section | 
|  | .IR "Background Specifications" , | 
|  | and | 
|  | .B backgrounditem | 
|  | is set to be an image item for the document's background image (if given as a URL), | 
|  | or else nil. | 
|  | .B Text | 
|  | gives the default foregound text color of the document, | 
|  | .B link | 
|  | the unvisited hyperlink color, | 
|  | .B vlink | 
|  | the visited hyperlink color, and | 
|  | .B alink | 
|  | the color for highlighting hyperlinks (all in 24-bit RGB format). | 
|  | .B Target | 
|  | is the default target frame id. | 
|  | .B Chset | 
|  | and | 
|  | .B mediatype | 
|  | are as for the | 
|  | .I chset | 
|  | and | 
|  | .I mtype | 
|  | parameters to | 
|  | .IR parsehtml . | 
|  | .B Scripttype | 
|  | is the type of any scripts contained in the document, and is always | 
|  | .BR TextJavascript . | 
|  | .B Hasscripts | 
|  | is set if the document contains any scripts. | 
|  | Scripting is currently unsupported. | 
|  | .B Refresh | 
|  | is the contents of a | 
|  | .B "<meta http-equiv=Refresh ...>" | 
|  | tag, if any. | 
|  | .B Kidinfo | 
|  | is set if this document is a frameset (see section | 
|  | .IR Frames ). | 
|  | .B Frameid | 
|  | is this document's frame id. | 
|  | .PP | 
|  | .B Anchors | 
|  | is a list of hyperlinks contained in the document, | 
|  | and | 
|  | .B dests | 
|  | is a list of hyperlink destinations within the page (see the following section for details). | 
|  | .BR Forms , | 
|  | .B tables | 
|  | and | 
|  | .B maps | 
|  | are lists of the various forms, tables and client-side maps contained | 
|  | in the document, as described in subsequent sections. | 
|  | .B Images | 
|  | is a list of all the image items in the document. | 
|  | .SS Anchors | 
|  | .PP | 
|  | The library builds two lists for all of the | 
|  | .B <a> | 
|  | elements (anchors) in a document. | 
|  | Each anchor is assigned a unique anchor id within the document. | 
|  | For anchors which are hyperlinks (the | 
|  | .B href | 
|  | attribute was supplied), the following structure is defined: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Anchor* 'u | 
|  | typedef struct Anchor Anchor; | 
|  | struct Anchor | 
|  | { | 
|  | Anchor*	next; | 
|  | int	index; | 
|  | Rune*	name; | 
|  | Rune*	href; | 
|  | int	target; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | points to the next anchor in the list (the head of this list is | 
|  | .BR Docinfo.anchors ). | 
|  | .B Index | 
|  | is the anchor id; each item within this hyperlink is tagged with this value | 
|  | in its | 
|  | .B anchorid | 
|  | field. | 
|  | .B Name | 
|  | and | 
|  | .B href | 
|  | are the values of the correspondingly named attributes of the anchor | 
|  | (in particular, href is the URL to go to). | 
|  | .B Target | 
|  | is the value of the target attribute (if provided) converted to a frame id. | 
|  | .PP | 
|  | Destinations within the document (anchors with the name attribute set) | 
|  | are held in the | 
|  | .B Docinfo.dests | 
|  | list, using the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'DestAnchor* 'u | 
|  | typedef struct DestAnchor DestAnchor; | 
|  | struct DestAnchor | 
|  | { | 
|  | DestAnchor*	next; | 
|  | int	index; | 
|  | Rune*	name; | 
|  | Item*	item; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | is the next element of the list, | 
|  | .B index | 
|  | is the anchor id, | 
|  | .B name | 
|  | is the value of the name attribute, and | 
|  | .B item | 
|  | is points to the item within the parsed document that should be considered | 
|  | to be the destination. | 
|  | .SS Forms | 
|  | .PP | 
|  | Any forms within a document are kept in a list, headed by | 
|  | .BR Docinfo.forms . | 
|  | The elements of this list are as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Formfield* 'u | 
|  | typedef struct Form Form; | 
|  | struct Form | 
|  | { | 
|  | Form*	next; | 
|  | int	formid; | 
|  | Rune*	name; | 
|  | Rune*	action; | 
|  | int	target; | 
|  | int	method; | 
|  | int	nfields; | 
|  | Formfield*	fields; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | points to the next form in the list. | 
|  | .B Formid | 
|  | is a serial number for the form within the document. | 
|  | .B Name | 
|  | is the value of the form's name or id attribute. | 
|  | .B Action | 
|  | is the value of any action attribute. | 
|  | .B Target | 
|  | is the value of the target attribute (if any) converted to a frame target id. | 
|  | .B Method | 
|  | is one of | 
|  | .B HGet | 
|  | or | 
|  | .BR HPost . | 
|  | .B Nfields | 
|  | is the number of fields in the form, and | 
|  | .B fields | 
|  | is a linked list of the actual fields. | 
|  | .PP | 
|  | The individual fields in a form are described by the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Formfield* 'u | 
|  | typedef struct Formfield Formfield; | 
|  | struct Formfield | 
|  | { | 
|  | Formfield*	next; | 
|  | int	ftype; | 
|  | int	fieldid; | 
|  | Form*	form; | 
|  | Rune*	name; | 
|  | Rune*	value; | 
|  | int	size; | 
|  | int	maxlength; | 
|  | int	rows; | 
|  | int	cols; | 
|  | uchar	flags; | 
|  | Option*	options; | 
|  | Item*	image; | 
|  | int	ctlid; | 
|  | SEvent*	events; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | Here, | 
|  | .B next | 
|  | points to the next field in the list. | 
|  | .B Ftype | 
|  | is the type of the field, which can be one of | 
|  | .BR Ftext , | 
|  | .BR Fpassword , | 
|  | .BR Fcheckbox , | 
|  | .BR Fradio , | 
|  | .BR Fsubmit , | 
|  | .BR Fhidden , | 
|  | .BR Fimage , | 
|  | .BR Freset , | 
|  | .BR Ffile , | 
|  | .BR Fbutton , | 
|  | .B Fselect | 
|  | or | 
|  | .BR Ftextarea . | 
|  | .B Fieldid | 
|  | is a serial number for the field within the form. | 
|  | .B Form | 
|  | points back to the form containing this field. | 
|  | .BR Name , | 
|  | .BR value , | 
|  | .BR size , | 
|  | .BR maxlength , | 
|  | .B rows | 
|  | and | 
|  | .B cols | 
|  | each contain the values of corresponding attributes of the field, if present. | 
|  | .B Flags | 
|  | contains per-field flags, of which | 
|  | .B FFchecked | 
|  | and | 
|  | .B FFmultiple | 
|  | are defined. | 
|  | .B Image | 
|  | is only used for fields of type | 
|  | .BR Fimage ; | 
|  | it points to an image item containing the image to be displayed. | 
|  | .B Ctlid | 
|  | is reserved for use by the caller, typically to store a unique id | 
|  | of an associated control used to implement the field. | 
|  | .B Events | 
|  | is the same as the corresponding field of the generic attributes | 
|  | associated with the item containing this field. | 
|  | .B Options | 
|  | is only used by fields of type | 
|  | .BR Fselect ; | 
|  | it consists of a list of possible options that may be selected for that | 
|  | field, using the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Option* 'u | 
|  | typedef struct Option Option; | 
|  | struct Option | 
|  | { | 
|  | Option*	next; | 
|  | int	selected; | 
|  | Rune*	value; | 
|  | Rune*	display; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | points to the next element of the list. | 
|  | .B Selected | 
|  | is set if this option is to be displayed initially. | 
|  | .B Value | 
|  | is the value to send when the form is submitted if this option is selected. | 
|  | .B Display | 
|  | is the string to display on the screen for this option. | 
|  | .SS Tables | 
|  | .PP | 
|  | The library builds a list of all the tables in the document, | 
|  | headed by | 
|  | .BR Docinfo.tables . | 
|  | Each element of this list has the following format: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Tablecell*** 'u | 
|  | typedef struct Table Table; | 
|  | struct Table | 
|  | { | 
|  | Table*	next; | 
|  | int	tableid; | 
|  | Tablerow*	rows; | 
|  | int	nrow; | 
|  | Tablecol*	cols; | 
|  | int	ncol; | 
|  | Tablecell*	cells; | 
|  | int	ncell; | 
|  | Tablecell***	grid; | 
|  | Align	align; | 
|  | Dimen	width; | 
|  | int	border; | 
|  | int	cellspacing; | 
|  | int	cellpadding; | 
|  | Background	background; | 
|  | Item*	caption; | 
|  | uchar	caption_place; | 
|  | Lay*	caption_lay; | 
|  | int	totw; | 
|  | int	toth; | 
|  | int	caph; | 
|  | int	availw; | 
|  | Token*	tabletok; | 
|  | uchar	flags; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | points to the next element in the list of tables. | 
|  | .B Tableid | 
|  | is a serial number for the table within the document. | 
|  | .B Rows | 
|  | is an array of row specifications (described below) and | 
|  | .B nrow | 
|  | is the number of elements in this array. | 
|  | Similarly, | 
|  | .B cols | 
|  | is an array of column specifications, and | 
|  | .B ncol | 
|  | the size of this array. | 
|  | .B Cells | 
|  | is a list of all cells within the table (structure described below) | 
|  | and | 
|  | .B ncell | 
|  | is the number of elements in this list. | 
|  | Note that a cell may span multiple rows and/or columns, thus | 
|  | .B ncell | 
|  | may be smaller than | 
|  | .BR nrow*ncol . | 
|  | .B Grid | 
|  | is a two-dimensional array of cells within the table; the cell | 
|  | at row | 
|  | .B i | 
|  | and column | 
|  | .B j | 
|  | is | 
|  | .BR Table.grid[i][j] . | 
|  | A cell that spans multiple rows and/or columns will | 
|  | be referenced by | 
|  | .B grid | 
|  | multiple times, however it will only occur once in | 
|  | .BR cells . | 
|  | .B Align | 
|  | gives the alignment specification for the entire table, | 
|  | and | 
|  | .B width | 
|  | gives the requested width as a dimension specification. | 
|  | .BR Border , | 
|  | .B cellspacing | 
|  | and | 
|  | .B cellpadding | 
|  | give the values of the corresponding attributes for the table, | 
|  | and | 
|  | .B background | 
|  | gives the requested background for the table. | 
|  | .B Caption | 
|  | is a linked list of items to be displayed as the caption of the | 
|  | table, either above or below depending on whether | 
|  | .B caption_place | 
|  | is | 
|  | .B ALtop | 
|  | or | 
|  | .BR ALbottom . | 
|  | Most of the remaining fields are reserved for use by the caller, | 
|  | except | 
|  | .BR tabletok , | 
|  | which is reserved for internal use. | 
|  | The type | 
|  | .B Lay | 
|  | is not defined by the library; the caller can provide its | 
|  | own definition. | 
|  | .PP | 
|  | The | 
|  | .B Tablecol | 
|  | structure is defined for use by the caller. | 
|  | The library ensures that the correct number of these | 
|  | is allocated, but leaves them blank. | 
|  | The fields are as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Point 'u | 
|  | typedef struct Tablecol Tablecol; | 
|  | struct Tablecol | 
|  | { | 
|  | int	width; | 
|  | Align	align; | 
|  | Point		pos; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | The rows in the table are specified as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Background 'u | 
|  | typedef struct Tablerow Tablerow; | 
|  | struct Tablerow | 
|  | { | 
|  | Tablerow*	next; | 
|  | Tablecell*	cells; | 
|  | int	height; | 
|  | int	ascent; | 
|  | Align	align; | 
|  | Background	background; | 
|  | Point	pos; | 
|  | uchar	flags; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | is only used during parsing; it should be ignored by the caller. | 
|  | .B Cells | 
|  | provides a list of all the cells in a row, linked through their | 
|  | .B nextinrow | 
|  | fields (see below). | 
|  | .BR Height , | 
|  | .B ascent | 
|  | and | 
|  | .B pos | 
|  | are reserved for use by the caller. | 
|  | .B Align | 
|  | is the alignment specification for the row, and | 
|  | .B background | 
|  | is the background to use, if specified. | 
|  | .B Flags | 
|  | is used by the parser; ignore this field. | 
|  | .PP | 
|  | The individual cells of the table are described as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Background 'u | 
|  | typedef struct Tablecell Tablecell; | 
|  | struct Tablecell | 
|  | { | 
|  | Tablecell*	next; | 
|  | Tablecell*	nextinrow; | 
|  | int	cellid; | 
|  | Item*	content; | 
|  | Lay*	lay; | 
|  | int	rowspan; | 
|  | int	colspan; | 
|  | Align	align; | 
|  | uchar	flags; | 
|  | Dimen	wspec; | 
|  | int	hspec; | 
|  | Background	background; | 
|  | int	minw; | 
|  | int	maxw; | 
|  | int	ascent; | 
|  | int	row; | 
|  | int	col; | 
|  | Point	pos; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | is used to link together the list of all cells within a table | 
|  | .RB ( Table.cells ), | 
|  | whereas | 
|  | .B nextinrow | 
|  | is used to link together all the cells within a single row | 
|  | .RB ( Tablerow.cells ). | 
|  | .B Cellid | 
|  | provides a serial number for the cell within the table. | 
|  | .B Content | 
|  | is a linked list of the items to be laid out within the cell. | 
|  | .B Lay | 
|  | is reserved for the user to describe how these items have | 
|  | been laid out. | 
|  | .B Rowspan | 
|  | and | 
|  | .B colspan | 
|  | are the number of rows and columns spanned by this cell, | 
|  | respectively. | 
|  | .B Align | 
|  | is the alignment specification for the cell. | 
|  | .B Flags | 
|  | is some combination of | 
|  | .BR TFparsing , | 
|  | .B TFnowrap | 
|  | and | 
|  | .B TFisth | 
|  | or'd together. | 
|  | Here | 
|  | .B TFparsing | 
|  | is used internally by the parser, and should be ignored. | 
|  | .B TFnowrap | 
|  | means that the contents of the cell should not be | 
|  | wrapped if they don't fit the available width, | 
|  | rather, the table should be expanded if need be | 
|  | (this is set when the nowrap attribute is supplied). | 
|  | .B TFisth | 
|  | means that the cell was created by the | 
|  | .B <th> | 
|  | element (rather than the | 
|  | .B <td> | 
|  | element), | 
|  | indicating that it is a header cell rather than a data cell. | 
|  | .B Wspec | 
|  | provides a suggested width as a dimension specification, | 
|  | and | 
|  | .B hspec | 
|  | provides a suggested height in pixels. | 
|  | .B Background | 
|  | gives a background specification for the individual cell. | 
|  | .BR Minw , | 
|  | .BR maxw , | 
|  | .B ascent | 
|  | and | 
|  | .B pos | 
|  | are reserved for use by the caller during layout. | 
|  | .B Row | 
|  | and | 
|  | .B col | 
|  | give the indices of the row and column of the top left-hand | 
|  | corner of the cell within the table grid. | 
|  | .SS Client-side Maps | 
|  | .PP | 
|  | The library builds a list of client-side maps, headed by | 
|  | .BR Docinfo.maps , | 
|  | and having the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Rune* 'u | 
|  | typedef struct Map Map; | 
|  | struct Map | 
|  | { | 
|  | Map*	next; | 
|  | Rune*	name; | 
|  | Area*	areas; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | points to the next element in the list, | 
|  | .B name | 
|  | is the name of the map (use to bind it to an image), and | 
|  | .B areas | 
|  | is a list of the areas within the image that comprise the map, | 
|  | using the following structure: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Dimen* 'u | 
|  | typedef struct Area Area; | 
|  | struct Area | 
|  | { | 
|  | Area*	next; | 
|  | int	shape; | 
|  | Rune*	href; | 
|  | int	target; | 
|  | Dimen*	coords; | 
|  | int	ncoords; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | points to the next element in the map's list of areas. | 
|  | .B Shape | 
|  | describes the shape of the area, and is one of | 
|  | .BR SHrect , | 
|  | .B SHcircle | 
|  | or | 
|  | .BR  SHpoly . | 
|  | .B Href | 
|  | is the URL associated with this area in its role as | 
|  | a hypertext link, and | 
|  | .B target | 
|  | is the target frame it should be loaded in. | 
|  | .B Coords | 
|  | is an array of coordinates for the shape, and | 
|  | .B ncoords | 
|  | is the size of this array (number of elements). | 
|  | .SS Frames | 
|  | .PP | 
|  | If the | 
|  | .B Docinfo.kidinfo | 
|  | field is set, the document is a frameset. | 
|  | In this case, it is typical for | 
|  | .I parsehtml | 
|  | to return nil, as a document which is a frameset should have no actual | 
|  | items that need to be laid out (such will appear only in subsidiary documents). | 
|  | It is possible that items will be returned by a malformed document; the caller | 
|  | should check for this and free any such items. | 
|  | .PP | 
|  | The | 
|  | .B Kidinfo | 
|  | structure itself reflects the fact that framesets can be nested within a document. | 
|  | If is defined as follows: | 
|  | .PP | 
|  | .EX | 
|  | .ta 6n +\w'Kidinfo* 'u | 
|  | typedef struct Kidinfo Kidinfo; | 
|  | struct Kidinfo | 
|  | { | 
|  | Kidinfo*	next; | 
|  | int	isframeset; | 
|  |  | 
|  | // fields for "frame" | 
|  | Rune*	src; | 
|  | Rune*	name; | 
|  | int	marginw; | 
|  | int	marginh; | 
|  | int	framebd; | 
|  | int	flags; | 
|  |  | 
|  | // fields for "frameset" | 
|  | Dimen*	rows; | 
|  | int	nrows; | 
|  | Dimen*	cols; | 
|  | int	ncols; | 
|  | Kidinfo*	kidinfos; | 
|  | Kidinfo*	nextframeset; | 
|  | }; | 
|  | .EE | 
|  | .PP | 
|  | .B Next | 
|  | is only used if this structure is part of a containing frameset; it points to the next | 
|  | element in the list of children of that frameset. | 
|  | .B Isframeset | 
|  | is set when this structure represents a frameset; if clear, it is an individual frame. | 
|  | .PP | 
|  | Some fields are used only for framesets. | 
|  | .B Rows | 
|  | is an array of dimension specifications for rows in the frameset, and | 
|  | .B nrows | 
|  | is the length of this array. | 
|  | .B Cols | 
|  | is the corresponding array for columns, of length | 
|  | .BR ncols . | 
|  | .B Kidinfos | 
|  | points to a list of components contained within this frameset, each | 
|  | of which may be a frameset or a frame. | 
|  | .B Nextframeset | 
|  | is only used during parsing, and should be ignored. | 
|  | .PP | 
|  | The remaining fields are used if the structure describes a frame, not a frameset. | 
|  | .B Src | 
|  | provides the URL for the document that should be initially loaded into this frame. | 
|  | Note that this may be a relative URL, in which case it should be interpretted | 
|  | using the containing document's URL as the base. | 
|  | .B Name | 
|  | gives the name of the frame, typically supplied via a name attribute in the HTML. | 
|  | If no name was given, the library allocates one. | 
|  | .BR Marginw , | 
|  | .B marginh | 
|  | and | 
|  | .B framebd | 
|  | are the values of the marginwidth, marginheight and frameborder attributes, respectively. | 
|  | .B Flags | 
|  | can contain some combination of the following: | 
|  | .B FRnoresize | 
|  | (the frame had the noresize attribute set, and the user should not be allowed to resize it), | 
|  | .B FRnoscroll | 
|  | (the frame should not have any scroll bars), | 
|  | .B FRhscroll | 
|  | (the frame should have a horizontal scroll bar), | 
|  | .B FRvscroll | 
|  | (the frame should have a vertical scroll bar), | 
|  | .B FRhscrollauto | 
|  | (the frame should be automatically given a horizontal scroll bar if its contents | 
|  | would not otherwise fit), and | 
|  | .B FRvscrollauto | 
|  | (the frame gets a vertical scrollbar only if required). | 
|  | .SH SOURCE | 
|  | .B \*9/src/libhtml | 
|  | .SH SEE ALSO | 
|  | .IR fmt (1) | 
|  | .PP | 
|  | W3C World Wide Web Consortium, | 
|  | ``HTML 4.01 Specification''. | 
|  | .SH BUGS | 
|  | The entire HTML document must be loaded into memory before | 
|  | any of it can be parsed. |