Home » SQL & PL/SQL » SQL & PL/SQL » Finding initial, final, in_between, and initial_and_final events
Finding initial, final, in_between, and initial_and_final events [message #687913] Thu, 20 July 2023 10:24 Go to next message
manubatham20
Messages: 566
Registered: September 2010
Location: Seattle, WA, USA
Senior Member


WITH RawData AS (
        SELECT 1 AS Sequence_ID, 'program6' AS Program_Name
        UNION ALL
        SELECT 2, 'program2'
        UNION ALL
        SELECT 3, 'program7'
        UNION ALL
        SELECT 4, 'program3'
        UNION ALL
        SELECT 5, 'program1'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 6, NULL
        UNION ALL
        SELECT 7, NULL
        UNION ALL
        SELECT 8, NULL
               -- Continue with program names
        UNION ALL
        SELECT 9, 'program4'
        UNION ALL
        SELECT 10, 'program1'
        UNION ALL
        SELECT 11, 'program2'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 12, NULL
        UNION ALL
        SELECT 13, NULL
               -- Continue with program names
        UNION ALL
        SELECT 14, 'program6'
        UNION ALL
        SELECT 15, 'program1'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 16, NULL
        UNION ALL
        -- Continue with program names
        SELECT 17, 'program5'
        UNION ALL
        SELECT 18, 'program2'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 19, NULL
               -- Continue with program names
        UNION ALL
        SELECT 20, 'program3'
        UNION ALL
        SELECT 21, 'program4'
        UNION ALL
        SELECT 22, 'program2'
        UNION ALL
        SELECT 23, 'program5'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 24, NULL
        UNION ALL
        SELECT 25, NULL
               -- Continue with program names
        UNION ALL
        SELECT 26, 'program1'
        UNION ALL
        SELECT 27, 'program7'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 28, NULL
               -- Continue with program names
        UNION ALL
        SELECT 29, 'program6'
        UNION ALL
        SELECT 30, 'program4'
        UNION ALL
        SELECT 31, 'program2'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 32, NULL
               -- Continue with program names
        UNION ALL
        SELECT 33, 'program5'
        UNION ALL
        SELECT 34, 'program3'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 35, NULL
        UNION ALL
        SELECT 36, NULL
               -- Continue with program names
        UNION ALL
        SELECT 37, 'program2'
        UNION ALL
        SELECT 38, 'program1'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 39, NULL
               -- Continue with program names
        UNION ALL
        SELECT 40, 'program6'
        UNION ALL
        SELECT 41, 'program4'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 42, NULL
        UNION ALL
        -- Continue with program names
        SELECT 43, 'program5'
        UNION ALL
        SELECT 44, 'program2'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 45, NULL
               -- Continue with program names
        UNION ALL
        SELECT 46, 'program1'
        UNION ALL
        SELECT 47, 'program7'
               -- Rows with no Program_Name
        UNION ALL
        SELECT 48, NULL
               -- Continue with program names
        UNION ALL
        SELECT 49, 'program3'
        UNION ALL
        SELECT 50, 'program5'
     ),
     pivotdata AS (
        SELECT
            sequence_id,
            program_name,
            ROW_NUMBER() OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id ASC) AS rno_asc,
            ROW_NUMBER() OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id DESC) AS rno_desc,
            CASE
                WHEN
                    FIRST_VALUE(program_name) OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) =
                    LAST_VALUE(program_name) OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
                    THEN 1
            END initial_and_final_flag
        FROM
            rawdata
     )
-- SELECT * FROM pivotdata
SELECT
    DECODE(MAX(CASE WHEN program_name = 'program1' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program1,
    DECODE(MAX(CASE WHEN program_name = 'program2' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program2,
    DECODE(MAX(CASE WHEN program_name = 'program3' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program3,
    DECODE(MAX(CASE WHEN program_name = 'program4' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program4,
    DECODE(MAX(CASE WHEN program_name = 'program5' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program5,
    DECODE(MAX(CASE WHEN program_name = 'program6' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program6,
    DECODE(MAX(CASE WHEN program_name = 'program7' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program7,
    DECODE(MAX(CASE WHEN program_name = 'program8' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program8
FROM
    (
        SELECT
            program_name,
            CASE
                WHEN program_name IS NOT NULL and rno_asc = 1 AND initial_and_final_flag = 1
                    THEN 4
                WHEN program_name IS NOT NULL and rno_desc = 1
                    THEN 3
                WHEN program_name IS NOT NULL and rno_asc = 1
                    THEN 2
                WHEN program_name IS NOT NULL
                    THEN 1
            END AS appearance
        FROM
            pivotdata
    ) AS pivoteddata
Given the RawData dataset (which has sequence_id to determine the order in which different programs appeared in a timeline, and it few sequence can have null program_name where there are no program associated)

I would like to determine, for (program1, program2, .... programn) where they have occurred in the timeline.

Possible values could be:
initial: This indicates that program appeared in the very beginning.
final: This indicates that the program appeared in the very last.
in_between: This indicates that appeared in between and was not initial or final.
initial_and_final: Represents that the same program was there initially, and when it concluded (i.e. final).
single: Only one single program was present in the timeline, and no other programs.

With current state of data, output should look like:
+----------+----------+----------+----------+--------+--------+----------+--------+
|program1  |program2  |program3  |program4  |program5|program6|program7  |program8|
+----------+----------+----------+----------+--------+--------+----------+--------+
|in_between|in_between|in_between|in_between|final   |initial |in_between|null    |
+----------+----------+----------+----------+--------+--------+----------+--------+
But if we change program_name='program6' for sequence 50, the output should look like:
+----------+----------+----------+----------+----------+--------------------+----------+--------+
|program1  |program2  |program3  |program4  |program5  |program6            |program7  |program8|
+----------+----------+----------+----------+----------+--------------------+----------+--------+
|in_between|in_between|in_between|in_between|in_between|initial_and_final   |in_between|null    |
+----------+----------+----------+----------+----------+--------------------+----------+--------+
And if we change all the sequence to either null, or just have one program (let's say program6), then the output should be:
+----------+----------+----------+----------+----------+---------+----------+--------+
|program1  |program2  |program3  |program4  |program5  |program6 |program7  |program8|
+----------+----------+----------+----------+----------+---------+----------+--------+
|null      |null      |null      |null      |null      |single   |null      |null    |
+----------+----------+----------+----------+----------+---------+----------+--------+
Note:





  1. first or last sequence can also have NULL program_name.

Thanks.

[Updated on: Thu, 20 July 2023 10:59]

Report message to a moderator

Re: Finding initial, final, in_between, and initial_and_final events [message #687914 is a reply to message #687913] Thu, 20 July 2023 11:23 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
Three questions.

First: this is a perfect application of match_recognize, but that was only introduced in Oracle 12.1. What is your Oracle version? That should be included with all questions. Sorry if you provided it already and I missed it.

Second: it would be much simpler (no dynamic query) if the output was in just two columns, with one program name per row, not one per column.

Third: A program name may be "single" in one group, "final" in another, etc. Your output only shows one category for each program name. Shouldn't there be one for each group of non-null names?
Re: Finding initial, final, in_between, and initial_and_final events [message #687915 is a reply to message #687914] Thu, 20 July 2023 12:06 Go to previous messageGo to next message
manubatham20
Messages: 566
Registered: September 2010
Location: Seattle, WA, USA
Senior Member

mathguy wrote on Thu, 20 July 2023 11:23
Three questions.

First: this is a perfect application of match_recognize, but that was only introduced in Oracle 12.1. What is your Oracle version? That should be included with all questions. Sorry if you provided it already and I missed it.

Second: it would be much simpler (no dynamic query) if the output was in just two columns, with one program name per row, not one per column.

Third: A program name may be "single" in one group, "final" in another, etc. Your output only shows one category for each program name. Shouldn't there be one for each group of non-null names?
1. Oracle 10g.
2. Nope, having one program per row doesn't make sense, as this is just a very simple example of a complex overall use case. If I have 8 pre-defined programs, I prefer to keep it in columns (sparse data) over one row per program, and pivoting it again and again.
3. If program name is classified as single, it will take precedence over any other category.
Re: Finding initial, final, in_between, and initial_and_final events [message #687916 is a reply to message #687915] Thu, 20 July 2023 12:09 Go to previous messageGo to next message
manubatham20
Messages: 566
Registered: September 2010
Location: Seattle, WA, USA
Senior Member

To add to original post, in the next steps, I can count distinct program, and can overwrite the program status with single if count distinct program_name = 1. But I'm looking if there is any better way to achieve overall output.
Re: Finding initial, final, in_between, and initial_and_final events [message #687917 is a reply to message #687916] Thu, 20 July 2023 12:37 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
OK, I think I get it - I misunderstood part of the task.

So, you have a set of predetermined program names (including 'program8' which doesn't appear anywhere in the inputs). So the columns in the output are pre-determined, there is nothing "dynamic" about the requirement. This is good.

Also, you are looking at the entire timeline, from sequence = 1 to sequence = 50 - I thought you were splitting it into sub-sequences when a name was NULL, but re-reading the question (and your follow-up), that isn't what you are doing. So there should be only one classification for each program name.

I am working on possible solutions, I will post if I can come up with something simple. But in the meanwhile, just to make sure: you are indeed using Oracle, right? I ask only because in your RawData query, you are selecting values, but not showing where you are selecting them FROM (missing FROM clause). This works in some other RDBMS products, but not in Oracle; in Oracle you select "FROM DUAL" when you need such things.

Also, where do the predetermined program names come from? In my modeling I am adding a subquery in the WITH clause, but if in your use case they come from some other source, you will need to adapt.
Re: Finding initial, final, in_between, and initial_and_final events [message #687918 is a reply to message #687916] Thu, 20 July 2023 12:41 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
...and one more question, sorry.

You said the first or last program name in the sequence can be NULL. In that case, what is the definition of "initial" and "final"? Is it the program name for sequence_id = 1, resp. 50, or is it the first (or last) NON-NULL name in the sequence, if any?
Re: Finding initial, final, in_between, and initial_and_final events [message #687919 is a reply to message #687918] Thu, 20 July 2023 13:52 Go to previous messageGo to next message
mathguy
Messages: 107
Registered: January 2023
Senior Member
I think this will do what you need. I assumed that program1, for example, is classified as 'initial' only if it appears at sequence_id = 1 (and not if it is the first NON-NULL program name in the sequence, if the program name at sequence_id = 1 is NULL).

with
  raw_data as ( ...... )
, prep as (
    select case grouping(program_name) when 0 then program_name else 'global' end as program_name,
           grouping_id(program_name) as lvl,
           min(case grouping(program_name) when 1 then min(program_name) end) over () as min_pn,
           min(case grouping(program_name) when 1 then max(program_name) end) over () as max_pn,
           min(case grouping(program_name) when 1 then min(program_name)
                            keep (dense_rank first order by sequence_id) end) over () as first_pn,
           min(case grouping(program_name) when 1 then min(program_name)
                            keep (dense_rank  last order by sequence_id) end) over () as last_pn
    from   rawdata
    group  by rollup(program_name)
  )
select *
from   ( select program_name, min_pn, max_pn, first_pn, last_pn
         from   prep
         where  lvl = 0
       )
pivot  ( min( case when min_pn = max_pn then case when program_name = min_pn then 'single' end
                   when first_pn = last_pn and program_name = first_pn       then 'initial_and_final'
                   when program_name = first_pn then 'initial'
                   when program_name = last_pn  then 'final'
                   else                              'in_between'
              end )
         for program_name in ('program1' as program1, 'program2' as program2, 'program3' as program3, 'program4' as program4,
                              'program5' as program5, 'program6' as program6, 'program7' as program7, 'program8' as program8)
       )
;
Re: Finding initial, final, in_between, and initial_and_final events [message #687920 is a reply to message #687919] Thu, 20 July 2023 16:27 Go to previous messageGo to next message
manubatham20
Messages: 566
Registered: September 2010
Location: Seattle, WA, USA
Senior Member

Thanks mathguy. The definition of initial and final is first and last NOT NULL program_name in order of sequence_id. Also, I have moved on from Oracle, but try to write code in such a fashion which is generic, and can be ported easily if required. But since the beginning of my career this forum has been super helpful, and I turn to this forum looking for suggestions.

Could you please explain, how your code is efficient, and intuitive over what I've already?

[Updated on: Thu, 20 July 2023 17:48]

Report message to a moderator

Re: Finding initial, final, in_between, and initial_and_final events [message #687921 is a reply to message #687920] Thu, 20 July 2023 21:46 Go to previous message
mathguy
Messages: 107
Registered: January 2023
Senior Member
If "initial" and "final" must capture the first and last non-null program names, then you need a small modification of the code: In the subquery named PREP in the WITH clause, add the WHERE clause between FROM and GROUP BY:

...
from  rawdata
where program_name is not null
group by rollup(program_name)
...
This assumes that the program names in the PROGRAM_NAME column are always a subset of the "predetermined" names. If the "predetermined" names are from 'program1' to 'program8' but the PROGRAM_NAME column in the data may contain many other values, then a different WHERE clause is more efficient:

where program_name in ('program1', 'program2', ...... , 'program8')
The most time and resource consuming part of the query is the GROUP BY operation, so it is best to cut down the number of rows to which it applies as early as possible.

How is this solution more intuitive and more efficient... I don't know about "intuitive" - one's intuition depends on one's experience. In my query all the meaningful work is done in the subquery named PREP - an aggregation over the input data, using the ROLLUP option, to achieve two goals: first, to get all the distinct program names that are present in the data, and second, to get the global min, max, first and last program names. We only need these aggregates at the global level (over all the data, not for each group separately); this is what the ROLLUP thing does. Otherwise the query could be written without GROUP BY, with a SELECT DISTINCT; but that wouldn't get the global min, max, first and last.

The SELECT list of the PREP subquery includes several analytic functions - over the entire output of the aggregation - just to present the global min, max, first and last names on each row. Note that the aggregation is performed first, resulting in a small number of groups in the output ('program1' to 'program7' and NULL, plus the rollup row). The analytic functions apply to a very small number of rows, compared to the raw data.

Then the computations in PIVOT are trivial, and over small data. All the time in the query is in the aggregation in PREP (and I mean specifically in the aggregation, not in the analytic functions applied after the fact).

In your query, you get to essentially the same intermediate values to use in comparisons, but in a more indirect way. First you apply analytic functions, to the raw data itself (potentially a large number of rows). Then you apply an aggregation (pivoting) to the result - again over the same potentially large number of rows. Whether that matters, and by how much, is an empirical question. I tried both queries with 500 rows in rawdata, both finished too fast to see any difference. I tried again with 1.5 million rows in rawdata; now my query finishes about 10 times faster (under 0.3 seconds vs. 3 seconds for your query, on my system). Whether this matters depends on how much data you have in the first place; if it's of the order of 500 rows, then it doesn't matter.

[Updated on: Fri, 21 July 2023 00:27]

Report message to a moderator

Previous Topic: Duplicate Rows
Next Topic: Updating a Hijri date in oracle date column
Goto Forum:
  


Current Time: Sat Apr 27 11:30:19 CDT 2024